Download Gemma 4 GGUF & Model Weights from Hugging Face

Gemma 4 model weights are available for free from Hugging Face, Kaggle, Ollama, and ModelScope. This guide covers every variant — E2B, E4B, 26B MoE, and 31B Dense — in every format: full-precision SafeTensors, quantized GGUF (Q4 / Q5 / Q8), GPTQ, and MLX — with direct download links and file sizes.

All Gemma 4 models are released under the Apache 2.0 license, which means you can download, use, modify, and redistribute them freely for any purpose — including commercial applications.

Gemma 4 GGUF Download Sizes on Hugging Face

Real file sizes read from unsloth's official Gemma 4 GGUF repositories on Hugging Face — the most-downloaded Gemma 4 GGUF publisher. Click any repo path to open its file list.

Model	Total Params	Q4_K_M	Q5_K_M	Q8_0	BF16	Hugging Face Repo
Gemma 4 E2B-it	5B	3.11 GB	3.36 GB	5.05 GB	9.31 GB	unsloth/gemma-4-E2B-it-GGUF
Gemma 4 E4B-it	8B	4.98 GB	5.48 GB	8.19 GB	15.1 GB	unsloth/gemma-4-E4B-it-GGUF
Gemma 4 26B-A4B-it	27B (MoE, 4B active)	16.9 GB	21.2 GB	26.9 GB	—	unsloth/gemma-4-26B-A4B-it-GGUF
Gemma 4 31B-it	33B (Dense)	18.3 GB	21.7 GB	32.6 GB	—	unsloth/gemma-4-31B-it-GGUF

Sizes verified from unsloth's Hugging Face repos on 2026-04-21. For full-precision SafeTensors, use the official google/gemma-4-E2B, -E4B, -26B-A4B, and -31B repos (add -it for instruction-tuned). 26B-A4B Q4 / Q5 files shipped by unsloth are Unsloth Dynamic (UD) variants in the Q4_K_M / Q5_K_M size tier.

Official Download Sources

Hugging Face

The primary platform for Gemma 4 model weights. Offers all variants in multiple formats including SafeTensors, GGUF, and GPTQ quantized versions. Supports git-based downloads, the Hugging Face CLI, and direct browser downloads.

• All model variants and sizes
• Multiple quantization formats
• Git LFS and CLI downloads
• Community-contributed quantizations
• Model cards with documentation

Kaggle

Google's data science platform hosts official Gemma 4 model weights. Convenient for users already in the Kaggle ecosystem, with notebook integration for quick experimentation.

• Official Google distribution
• Notebook integration
• Version tracking
• Direct download

Ollama Library

Pre-packaged Gemma 4 models optimized for local inference with Ollama. One-command download and run. Models are automatically quantized and optimized for your hardware.

• One-command install
• Auto-optimized for your hardware
• All variants available
• Automatic updates

ModelScope (魔搭社区)

China-based model hosting platform with fast download speeds for users in Asia. Mirrors the official Gemma 4 models with full documentation in Chinese.

• Fast downloads in China/Asia
• Chinese documentation
• Git-based downloads
• Community models

Model Format Guide

Understanding the different model file formats available for Gemma 4:

SafeTensors (.safetensors)

The default format on Hugging Face. Safe, fast-loading tensors designed to prevent code execution vulnerabilities. Used with Hugging Face Transformers, vLLM, and other Python-based frameworks.

Research, fine-tuning, Python frameworks, vLLM serving

GGUF (.gguf)

The standard format for llama.cpp and Ollama. Supports various quantization levels (Q4, Q5, Q8, etc.) to reduce model size and memory requirements. Optimized for CPU and mixed CPU/GPU inference.

Local inference, Ollama, llama.cpp, KoboldCpp, LM Studio

GPTQ

GPU-optimized quantization format that maintains high accuracy while significantly reducing VRAM requirements. Available through community contributions on Hugging Face.

GPU inference with reduced VRAM, production serving

MLX Format

Apple's native ML format optimized for Apple Silicon (M1/M2/M3/M4). Leverages unified memory architecture for efficient inference on Mac hardware.

Mac with Apple Silicon, MLX framework

Quantization Guide

Quantization reduces model size and memory usage at the cost of some accuracy. Here's how different levels compare for Gemma 4:

Format	Bits	Quality	Notes
BF16 / FP16 (Full Precision)	16-bit	100%	Full model quality with no accuracy loss. Requires the most VRAM and disk space.
INT8 / Q8	8-bit	~98-99%	Minimal quality loss. Halves VRAM requirements compared to FP16. Recommended for most GPU deployments.
Q5_K_M	5-bit	~95-97%	Good balance of quality and size. Popular choice for local inference with GGUF format.
INT4 / Q4_K_M	4-bit	~93-95%	Significant size reduction with acceptable quality for most use cases. Enables running larger models on consumer hardware.

Download via Command Line

Hugging Face CLI

Install the Hugging Face CLI and download models directly:

pip install huggingface_hub

# Full-precision SafeTensors (official Google repo)
huggingface-cli download google/gemma-4-31B-it

# GGUF quantized (community, unsloth — most downloaded)
huggingface-cli download unsloth/gemma-4-31B-it-GGUF \
  --include "gemma-4-31B-it-Q4_K_M.gguf"

Git LFS

Clone model repositories with Git Large File Storage:

git lfs install
git clone https://huggingface.co/google/gemma-4-31B-it

Ollama CLI

Pull models directly into Ollama:

# Pull any variant
ollama pull gemma4:e2b
ollama pull gemma4:e4b
ollama pull gemma4:26b
ollama pull gemma4:31b

Download FAQ

Where is the best place to download Gemma 4?

Hugging Face is the most comprehensive source with all formats and variants. For one-command local setup, use Ollama. For users in China, ModelScope offers faster download speeds.

What format should I download?

For Ollama or llama.cpp: download GGUF files. For Python/vLLM: use SafeTensors format. For Mac with Apple Silicon: use MLX format. If unsure, start with Ollama which handles format selection automatically.

How large are Gemma 4 model files?

Full precision sizes: E2B (~4GB), E4B (~8GB), 26B MoE (~52GB), 31B Dense (~62GB). Q4 quantized versions are roughly 4x smaller. Ollama's default downloads use optimized quantization.

Do I need a Hugging Face account to download?

No. Gemma 4 models are publicly accessible under Apache 2.0 license. You can download without an account, though having one enables faster downloads and access to the Hugging Face CLI.

What is a GGUF file?

GGUF (GPT-Generated Unified Format) is a binary format designed for efficient local inference with llama.cpp and Ollama. It supports various quantization levels, allowing you to trade accuracy for smaller file sizes and lower memory usage.

Can I download Gemma 4 in China?

Yes. ModelScope (魔搭社区) mirrors Gemma 4 models with fast download speeds within China. Alternatively, use a mirror or proxy for Hugging Face downloads.

Download and Deploy

Get Gemma 4 model weights and start deploying. Check our deployment guide for step-by-step setup instructions.

Deployment Guide Compare Models Try Online First