Gemma 4 Benchmark Results
Gemma 4 delivers state-of-the-art performance across major academic and industry benchmarks, establishing itself as the most capable open-source model family available. The flagship 31B dense model rivals proprietary models from OpenAI, Anthropic, and Google's own Gemini line.
This page provides detailed benchmark scores, methodology explanations, and cross-model comparisons to help you evaluate which Gemma 4 variant best fits your use case.
Core Benchmark Scores
Performance of the Gemma 4 31B flagship model on key benchmarks:
AIME 2026
Mathematical ReasoningThe American Invitational Mathematics Examination tests advanced mathematical reasoning and multi-step problem solving. Gemma 4's 89.2% score demonstrates exceptional ability in competition-level mathematics, including algebra, geometry, number theory, and combinatorics.
LiveCodeBench v6
Code GenerationLiveCodeBench evaluates real-world coding ability across code generation, debugging, refactoring, and understanding tasks in multiple programming languages. The 80% score places Gemma 4 among the top coding models available.
GPQA Diamond
Expert KnowledgeGraduate-level question answering spanning physics, chemistry, and biology. Questions are designed by domain experts and verified by PhD-level reviewers. Gemma 4's strong performance indicates deep scientific reasoning capabilities.
MMMLU
Multilingual UnderstandingMassive Multitask Multilingual Language Understanding evaluates broad knowledge and reasoning across 140+ languages and dozens of academic subjects. This score confirms Gemma 4's strength as a truly multilingual model.
Performance Across Model Variants
How each Gemma 4 variant performs relative to the 31B flagship:
| Model | AIME 2026 | LCB v6 | GPQA | MMMLU |
|---|---|---|---|---|
| 31B Dense | 89.2% | 80.0% | 84.3% | 85.2% |
| 26B A4B MoE | ~85% | ~76% | ~80% | ~82% |
| E4B | ~62% | ~55% | ~58% | ~68% |
| E2B | ~45% | ~38% | ~42% | ~55% |
Scores are approximate and may vary based on quantization level and inference configuration. The 31B model represents the peak performance of the Gemma 4 family.
Key Strengths in Review
Mathematical Reasoning
The 89.2% AIME score is among the highest for any open-source model, demonstrating Gemma 4's exceptional ability to handle complex, multi-step mathematical problems that require deep logical reasoning.
Code Generation Quality
At 80% on LiveCodeBench v6, Gemma 4 produces production-quality code across Python, JavaScript, TypeScript, Go, Rust, and other languages. It excels at understanding complex codebases and generating contextually appropriate solutions.
Multilingual Performance
Unlike many models that excel only in English, Gemma 4 maintains strong performance across 140+ languages. The 85.2% MMMLU score reflects consistent quality across linguistic boundaries.
Efficiency via MoE Architecture
The 26B A4B MoE variant achieves near-flagship performance while activating only 4B parameters per inference, delivering an exceptional performance-per-compute ratio for production deployments.
Benchmark FAQ
What benchmarks does Gemma 4 perform best on?
Gemma 4 31B excels particularly in mathematical reasoning (AIME 2026: 89.2%), scientific knowledge (GPQA Diamond: 84.3%), and code generation (LiveCodeBench v6: 80%). These scores rival or exceed many proprietary models.
How does Gemma 4 compare to Llama 4?
Gemma 4 31B and Llama 4 are both competitive open-source models. Gemma 4 tends to outperform in multimodal tasks, multilingual understanding, and mathematical reasoning, while both models trade leads across different benchmarks.
Are benchmark scores consistent across quantization levels?
There is typically a 1-3% degradation in benchmark performance when using INT8 quantization, and 2-5% with INT4. The exact impact varies by benchmark and model variant. BF16 (full precision) provides the best scores.
Does the MoE model (26B A4B) match the dense model (31B)?
The 26B MoE model achieves approximately 90-95% of the 31B dense model's benchmark scores while requiring significantly less compute per inference. For most practical applications, the quality difference is negligible.
How were these benchmarks measured?
Benchmark scores are based on Google DeepMind's official evaluations using standard evaluation protocols. Independent reproductions by the community on platforms like Hugging Face Open LLM Leaderboard have confirmed similar results.
Is Gemma 4 the best open-source model?
As of April 2026, Gemma 4 31B is among the top open-source models across most major benchmarks. The landscape evolves rapidly, but Gemma 4's combination of multimodal capabilities, long context, and strong reasoning makes it a leading choice.
benchmarksPage.faq.items.6.q
benchmarksPage.faq.items.6.a
benchmarksPage.faq.items.7.q
benchmarksPage.faq.items.7.a
benchmarksPage.faq.items.8.q
benchmarksPage.faq.items.8.a
benchmarksPage.faq.items.9.q
benchmarksPage.faq.items.9.a
Experience Gemma 4 Performance
See the benchmark numbers in action. Try Gemma 4 in your browser or deploy it on your own hardware.