Gemma 4 delivers state-of-the-art performance across major academic and industry benchmarks, establishing itself as the most capable open-source model family available. The flagship 31B dense model rivals proprietary models from OpenAI, Anthropic, and Google's own Gemini line.
This page provides detailed benchmark scores, methodology explanations, and cross-model comparisons to help you evaluate which Gemma 4 variant best fits your use case.
Performance of the Gemma 4 31B flagship model on key benchmarks:
The American Invitational Mathematics Examination tests advanced mathematical reasoning and multi-step problem solving. Gemma 4's 89.2% score demonstrates exceptional ability in competition-level mathematics, including algebra, geometry, number theory, and combinatorics.
LiveCodeBench evaluates real-world coding ability across code generation, debugging, refactoring, and understanding tasks in multiple programming languages. The 80% score places Gemma 4 among the top coding models available.
Graduate-level question answering spanning physics, chemistry, and biology. Questions are designed by domain experts and verified by PhD-level reviewers. Gemma 4's strong performance indicates deep scientific reasoning capabilities.
Massive Multitask Multilingual Language Understanding evaluates broad knowledge and reasoning across 140+ languages and dozens of academic subjects. This score confirms Gemma 4's strength as a truly multilingual model.
How each Gemma 4 variant performs relative to the 31B flagship:
| Model | AIME 2026 | LCB v6 | GPQA | MMMLU |
|---|---|---|---|---|
| 31B Dense | 89.2% | 80.0% | 84.3% | 85.2% |
| 26B A4B MoE | ~85% | ~76% | ~80% | ~82% |
| E4B | ~62% | ~55% | ~58% | ~68% |
| E2B | ~45% | ~38% | ~42% | ~55% |
Scores are approximate and may vary based on quantization level and inference configuration. The 31B model represents the peak performance of the Gemma 4 family.
The 89.2% AIME score is among the highest for any open-source model, demonstrating Gemma 4's exceptional ability to handle complex, multi-step mathematical problems that require deep logical reasoning.
At 80% on LiveCodeBench v6, Gemma 4 produces production-quality code across Python, JavaScript, TypeScript, Go, Rust, and other languages. It excels at understanding complex codebases and generating contextually appropriate solutions.
Unlike many models that excel only in English, Gemma 4 maintains strong performance across 140+ languages. The 85.2% MMMLU score reflects consistent quality across linguistic boundaries.
The 26B A4B MoE variant achieves near-flagship performance while activating only 4B parameters per inference, delivering an exceptional performance-per-compute ratio for production deployments.
Gemma 4 31B excels particularly in mathematical reasoning (AIME 2026: 89.2%), scientific knowledge (GPQA Diamond: 84.3%), and code generation (LiveCodeBench v6: 80%). These scores rival or exceed many proprietary models.
Gemma 4 31B and Llama 4 are both competitive open-source models. Gemma 4 tends to outperform in multimodal tasks, multilingual understanding, and mathematical reasoning, while both models trade leads across different benchmarks.
There is typically a 1-3% degradation in benchmark performance when using INT8 quantization, and 2-5% with INT4. The exact impact varies by benchmark and model variant. BF16 (full precision) provides the best scores.
The 26B MoE model achieves approximately 90-95% of the 31B dense model's benchmark scores while requiring significantly less compute per inference. For most practical applications, the quality difference is negligible.
Benchmark scores are based on Google DeepMind's official evaluations using standard evaluation protocols. Independent reproductions by the community on platforms like Hugging Face Open LLM Leaderboard have confirmed similar results.
As of April 2026, Gemma 4 31B is among the top open-source models across most major benchmarks. The landscape evolves rapidly, but Gemma 4's combination of multimodal capabilities, long context, and strong reasoning makes it a leading choice.
pages.benchmarks.benchmarksPage.faq.items.6.a
pages.benchmarks.benchmarksPage.faq.items.7.a
pages.benchmarks.benchmarksPage.faq.items.8.a
pages.benchmarks.benchmarksPage.faq.items.9.a
See the benchmark numbers in action. Try Gemma 4 in your browser or deploy it on your own hardware.