Gemma 4 model weights are available for free from Hugging Face, Kaggle, Ollama, and ModelScope. This guide covers every variant — E2B, E4B, 26B MoE, and 31B Dense — in every format: full-precision SafeTensors, quantized GGUF (Q4 / Q5 / Q8), GPTQ, and MLX — with direct download links and file sizes.
All Gemma 4 models are released under the Apache 2.0 license, which means you can download, use, modify, and redistribute them freely for any purpose — including commercial applications.
Real file sizes read from unsloth's official Gemma 4 GGUF repositories on Hugging Face — the most-downloaded Gemma 4 GGUF publisher. Click any repo path to open its file list.
| Model | Total Params | Q4_K_M | Q5_K_M | Q8_0 | BF16 | Hugging Face Repo |
|---|---|---|---|---|---|---|
| Gemma 4 E2B-it | 5B | 3.11 GB | 3.36 GB | 5.05 GB | 9.31 GB | unsloth/gemma-4-E2B-it-GGUF |
| Gemma 4 E4B-it | 8B | 4.98 GB | 5.48 GB | 8.19 GB | 15.1 GB | unsloth/gemma-4-E4B-it-GGUF |
| Gemma 4 26B-A4B-it | 27B (MoE, 4B active) | 16.9 GB | 21.2 GB | 26.9 GB | — | unsloth/gemma-4-26B-A4B-it-GGUF |
| Gemma 4 31B-it | 33B (Dense) | 18.3 GB | 21.7 GB | 32.6 GB | — | unsloth/gemma-4-31B-it-GGUF |
Sizes verified from unsloth's Hugging Face repos on 2026-04-21. For full-precision SafeTensors, use the official google/gemma-4-E2B, -E4B, -26B-A4B, and -31B repos (add -it for instruction-tuned). 26B-A4B Q4 / Q5 files shipped by unsloth are Unsloth Dynamic (UD) variants in the Q4_K_M / Q5_K_M size tier.
The primary platform for Gemma 4 model weights. Offers all variants in multiple formats including SafeTensors, GGUF, and GPTQ quantized versions. Supports git-based downloads, the Hugging Face CLI, and direct browser downloads.
Google's data science platform hosts official Gemma 4 model weights. Convenient for users already in the Kaggle ecosystem, with notebook integration for quick experimentation.
Pre-packaged Gemma 4 models optimized for local inference with Ollama. One-command download and run. Models are automatically quantized and optimized for your hardware.
China-based model hosting platform with fast download speeds for users in Asia. Mirrors the official Gemma 4 models with full documentation in Chinese.
Understanding the different model file formats available for Gemma 4:
The default format on Hugging Face. Safe, fast-loading tensors designed to prevent code execution vulnerabilities. Used with Hugging Face Transformers, vLLM, and other Python-based frameworks.
Research, fine-tuning, Python frameworks, vLLM serving
The standard format for llama.cpp and Ollama. Supports various quantization levels (Q4, Q5, Q8, etc.) to reduce model size and memory requirements. Optimized for CPU and mixed CPU/GPU inference.
Local inference, Ollama, llama.cpp, KoboldCpp, LM Studio
GPU-optimized quantization format that maintains high accuracy while significantly reducing VRAM requirements. Available through community contributions on Hugging Face.
GPU inference with reduced VRAM, production serving
Apple's native ML format optimized for Apple Silicon (M1/M2/M3/M4). Leverages unified memory architecture for efficient inference on Mac hardware.
Mac with Apple Silicon, MLX framework
Quantization reduces model size and memory usage at the cost of some accuracy. Here's how different levels compare for Gemma 4:
| Format | Bits | Quality | Notes |
|---|---|---|---|
| BF16 / FP16 (Full Precision) | 16-bit | 100% | Full model quality with no accuracy loss. Requires the most VRAM and disk space. |
| INT8 / Q8 | 8-bit | ~98-99% | Minimal quality loss. Halves VRAM requirements compared to FP16. Recommended for most GPU deployments. |
| Q5_K_M | 5-bit | ~95-97% | Good balance of quality and size. Popular choice for local inference with GGUF format. |
| INT4 / Q4_K_M | 4-bit | ~93-95% | Significant size reduction with acceptable quality for most use cases. Enables running larger models on consumer hardware. |
Install the Hugging Face CLI and download models directly:
pip install huggingface_hub
# Full-precision SafeTensors (official Google repo)
huggingface-cli download google/gemma-4-31B-it
# GGUF quantized (community, unsloth — most downloaded)
huggingface-cli download unsloth/gemma-4-31B-it-GGUF \
--include "gemma-4-31B-it-Q4_K_M.gguf"Clone model repositories with Git Large File Storage:
git lfs install
git clone https://huggingface.co/google/gemma-4-31B-itPull models directly into Ollama:
# Pull any variant
ollama pull gemma4:e2b
ollama pull gemma4:e4b
ollama pull gemma4:26b
ollama pull gemma4:31bHugging Face is the most comprehensive source with all formats and variants. For one-command local setup, use Ollama. For users in China, ModelScope offers faster download speeds.
For Ollama or llama.cpp: download GGUF files. For Python/vLLM: use SafeTensors format. For Mac with Apple Silicon: use MLX format. If unsure, start with Ollama which handles format selection automatically.
Full precision sizes: E2B (~4GB), E4B (~8GB), 26B MoE (~52GB), 31B Dense (~62GB). Q4 quantized versions are roughly 4x smaller. Ollama's default downloads use optimized quantization.
No. Gemma 4 models are publicly accessible under Apache 2.0 license. You can download without an account, though having one enables faster downloads and access to the Hugging Face CLI.
GGUF (GPT-Generated Unified Format) is a binary format designed for efficient local inference with llama.cpp and Ollama. It supports various quantization levels, allowing you to trade accuracy for smaller file sizes and lower memory usage.
Yes. ModelScope (魔搭社区) mirrors Gemma 4 models with fast download speeds within China. Alternatively, use a mirror or proxy for Hugging Face downloads.
Get Gemma 4 model weights and start deploying. Check our deployment guide for step-by-step setup instructions.