One of Gemma 4's greatest strengths is its range of model sizes — from the ultra-compact E2B that runs on a smartphone to the flagship 31B that requires a high-end GPU. This guide breaks down the exact hardware requirements for each variant so you can choose the right model for your setup.
Hardware needs depend on three factors: model variant, quantization level, and context length. Lower quantization and shorter context reduce requirements significantly, making Gemma 4 accessible on a wide range of hardware.
| Model | Parameters | VRAM (FP16) | VRAM (INT8) | VRAM (INT4) | Disk Space |
|---|---|---|---|---|---|
| E2B | 2B | 4 GB | 2.5 GB | 1.5 GB | ~1.5–4 GB |
| E4B | 4B | 8 GB | 5 GB | 3 GB | ~3–8 GB |
| 26B MoE | 26B | 52 GB | 28 GB | 16 GB | ~15–52 GB |
| 31B Dense | 31B | 62 GB | 33 GB | 18 GB | ~18–62 GB |
1.5–4 GB
4 GB system RAM minimum
~1.5 GB (quantized) / ~4 GB (FP16)
No dedicated GPU required. Runs on CPU, mobile NPU, or integrated GPU.
Devices: Smartphones (iOS/Android), Raspberry Pi 5, tablets, edge appliances
The E2B model is designed specifically for resource-constrained environments. It runs efficiently on mobile NPUs and even CPU-only configurations. Ideal for on-device inference where privacy and latency are priorities.
3–8 GB
8 GB system RAM minimum
~3 GB (quantized) / ~8 GB (FP16)
Any GPU with 4GB+ VRAM, or CPU-only with sufficient RAM
Devices: Laptops, desktops, Mac with Apple Silicon (M1+), low-end cloud instances
The sweet spot for most personal use. Runs well on a MacBook Air M1 with 8GB unified memory. On Windows/Linux, an RTX 3060 (12GB) handles it easily. CPU inference is feasible but slower.
16–52 GB
32 GB system RAM recommended
~15 GB (quantized) / ~52 GB (FP16)
RTX 4090 (24GB), RTX A5000, A100 (40/80GB), or Apple M2 Ultra+
Devices: High-end desktops, workstations, cloud GPU instances (A100, L4, H100)
Despite having 26B total parameters, the MoE architecture activates only 4B parameters per inference. INT4 quantization brings VRAM usage to ~16GB, making it accessible on RTX 4090. For FP16, you'll need 48GB+ VRAM or multi-GPU setups.
18–62 GB
64 GB system RAM recommended
~18 GB (quantized) / ~62 GB (FP16)
RTX 4090 (24GB for INT4), A100 (40/80GB), H100, or Apple M2 Ultra+
Devices: Workstations, servers, cloud GPU instances, multi-GPU setups
The flagship model requires serious hardware for full precision but is accessible at INT4 quantization on a single RTX 4090. For production serving at scale, A100 or H100 GPUs are recommended. Apple Silicon Macs with 64GB+ unified memory can run it via MLX.
Which GPU should you get for Gemma 4?
Entry-level for Gemma 4. Handles E4B at INT4 comfortably.
Can run the 26B MoE model at INT4 quantization.
The sweet spot. Runs all models at INT4, and 26B at INT8.
Professional/cloud GPU. Full FP16 for all models on 80GB variant.
Unified memory. Great with MLX framework.
Massive unified memory handles even 31B at FP16.
Longer context windows require additional memory beyond the model weights. The KV cache grows linearly with context length:
| Context | E4B | 26B MoE | 31B Dense |
|---|---|---|---|
| 8K | +0.2 GB | +0.5 GB | +0.6 GB |
| 32K | +0.8 GB | +2.0 GB | +2.4 GB |
| 128K | +3.2 GB | +8.0 GB | +9.6 GB |
| 256K | N/A | +16 GB | +19.2 GB |
These are approximate additional VRAM requirements on top of the base model. Actual usage depends on batch size and implementation.
Yes. All Gemma 4 variants support CPU inference via Ollama or llama.cpp. E2B and E4B run at reasonable speeds on modern CPUs. Larger models will be slow but functional. Ensure sufficient system RAM — roughly 2x the model file size.
At INT4 quantization: E2B needs ~1.5GB, E4B ~3GB, 26B MoE ~16GB, 31B Dense ~18GB. At FP16 (full precision): E2B ~4GB, E4B ~8GB, 26B ~52GB, 31B ~62GB. Most users should use INT4 or INT8 quantization.
Yes, at INT4 quantization (~18GB VRAM). The RTX 4090's 24GB is sufficient for this. For higher precision, you'll need more VRAM — consider A100 80GB or multi-GPU setups.
Apple Silicon Macs with unified memory are excellent for Gemma 4. An M1/M2 with 16GB runs E4B well. M3 Max (36-48GB) handles the 26B MoE. M2/M3 Ultra (64GB+) can run the 31B model. Use MLX or Ollama for best performance.
INT8 quantization typically preserves 98-99% of quality. INT4 preserves 93-95%. For most practical use cases, INT4 is perfectly acceptable. Only research or evaluation tasks requiring exact reproducibility benefit from FP16.
Yes. vLLM, llama.cpp, and other frameworks support tensor parallelism across multiple GPUs. This lets you run the 31B model at higher precision by splitting it across 2x RTX 4090s (48GB total) or similar configurations.
pages.requirements.requirementsPage.faq.items.6.a
pages.requirements.requirementsPage.faq.items.7.a
pages.requirements.requirementsPage.faq.items.8.a
pages.requirements.requirementsPage.faq.items.9.a
Now that you know the requirements, set up Gemma 4 on your hardware.