Gemma 4 Hardware Requirements

One of Gemma 4's greatest strengths is its range of model sizes — from the ultra-compact E2B that runs on a smartphone to the flagship 31B that requires a high-end GPU. This guide breaks down the exact hardware requirements for each variant so you can choose the right model for your setup.

Hardware needs depend on three factors: model variant, quantization level, and context length. Lower quantization and shorter context reduce requirements significantly, making Gemma 4 accessible on a wide range of hardware.

Quick Reference: Minimum Requirements

Model	Parameters	VRAM (FP16)	VRAM (INT8)	VRAM (INT4)	Disk Space
E2B	2B	4 GB	2.5 GB	1.5 GB	~1.5–4 GB
E4B	4B	8 GB	5 GB	3 GB	~3–8 GB
26B MoE	26B	52 GB	28 GB	16 GB	~15–52 GB
31B Dense	31B	62 GB	33 GB	18 GB	~18–62 GB

Detailed Requirements by Model

Gemma 4 E2B — Edge & Mobile

VRAM

1.5–4 GB

RAM

4 GB system RAM minimum

Disk

~1.5 GB (quantized) / ~4 GB (FP16)

GPU

No dedicated GPU required. Runs on CPU, mobile NPU, or integrated GPU.

Devices: Smartphones (iOS/Android), Raspberry Pi 5, tablets, edge appliances

The E2B model is designed specifically for resource-constrained environments. It runs efficiently on mobile NPUs and even CPU-only configurations. Ideal for on-device inference where privacy and latency are priorities.

Gemma 4 E4B — Laptop & Desktop

VRAM

3–8 GB

RAM

8 GB system RAM minimum

Disk

~3 GB (quantized) / ~8 GB (FP16)

GPU

Any GPU with 4GB+ VRAM, or CPU-only with sufficient RAM

Devices: Laptops, desktops, Mac with Apple Silicon (M1+), low-end cloud instances

The sweet spot for most personal use. Runs well on a MacBook Air M1 with 8GB unified memory. On Windows/Linux, an RTX 3060 (12GB) handles it easily. CPU inference is feasible but slower.

Gemma 4 26B A4B (MoE) — Desktop GPU

VRAM

16–52 GB

RAM

32 GB system RAM recommended

Disk

~15 GB (quantized) / ~52 GB (FP16)

GPU

RTX 4090 (24GB), RTX A5000, A100 (40/80GB), or Apple M2 Ultra+

Devices: High-end desktops, workstations, cloud GPU instances (A100, L4, H100)

Despite having 26B total parameters, the MoE architecture activates only 4B parameters per inference. INT4 quantization brings VRAM usage to ~16GB, making it accessible on RTX 4090. For FP16, you'll need 48GB+ VRAM or multi-GPU setups.

Gemma 4 31B Dense — Workstation & Server

VRAM

18–62 GB

RAM

64 GB system RAM recommended

Disk

~18 GB (quantized) / ~62 GB (FP16)

GPU

RTX 4090 (24GB for INT4), A100 (40/80GB), H100, or Apple M2 Ultra+

Devices: Workstations, servers, cloud GPU instances, multi-GPU setups

The flagship model requires serious hardware for full precision but is accessible at INT4 quantization on a single RTX 4090. For production serving at scale, A100 or H100 GPUs are recommended. Apple Silicon Macs with 64GB+ unified memory can run it via MLX.

Recommended GPUs

Which GPU should you get for Gemma 4?

NVIDIA RTX 4060 (8GB)

E2B, E4B

Entry-level for Gemma 4. Handles E4B at INT4 comfortably.

NVIDIA RTX 4070 Ti Super (16GB)

E2B, E4B, 26B (INT4)

Can run the 26B MoE model at INT4 quantization.

NVIDIA RTX 4090 (24GB)

All models (quantized)

The sweet spot. Runs all models at INT4, and 26B at INT8.

NVIDIA A100 (40/80GB)

All models (all precisions)

Professional/cloud GPU. Full FP16 for all models on 80GB variant.

Apple M3 Max (36/48GB)

E2B, E4B, 26B (INT4/INT8)

Unified memory. Great with MLX framework.

Apple M2/M3 Ultra (64-192GB)

All models (all precisions)

Massive unified memory handles even 31B at FP16.

Context Length Impact on Memory

Longer context windows require additional memory beyond the model weights. The KV cache grows linearly with context length:

Context	E4B	26B MoE	31B Dense
8K	+0.2 GB	+0.5 GB	+0.6 GB
32K	+0.8 GB	+2.0 GB	+2.4 GB
128K	+3.2 GB	+8.0 GB	+9.6 GB
256K	N/A	+16 GB	+19.2 GB

These are approximate additional VRAM requirements on top of the base model. Actual usage depends on batch size and implementation.

Hardware FAQ

Can I run Gemma 4 without a GPU?

Yes. All Gemma 4 variants support CPU inference via Ollama or llama.cpp. E2B and E4B run at reasonable speeds on modern CPUs. Larger models will be slow but functional. Ensure sufficient system RAM — roughly 2x the model file size.

How much VRAM do I need for Gemma 4?

At INT4 quantization: E2B needs ~1.5GB, E4B ~3GB, 26B MoE ~16GB, 31B Dense ~18GB. At FP16 (full precision): E2B ~4GB, E4B ~8GB, 26B ~52GB, 31B ~62GB. Most users should use INT4 or INT8 quantization.

Can I run Gemma 4 31B on an RTX 4090?

Yes, at INT4 quantization (~18GB VRAM). The RTX 4090's 24GB is sufficient for this. For higher precision, you'll need more VRAM — consider A100 80GB or multi-GPU setups.

What about Mac with Apple Silicon?

Apple Silicon Macs with unified memory are excellent for Gemma 4. An M1/M2 with 16GB runs E4B well. M3 Max (36-48GB) handles the 26B MoE. M2/M3 Ultra (64GB+) can run the 31B model. Use MLX or Ollama for best performance.

Does quantization affect quality?

INT8 quantization typically preserves 98-99% of quality. INT4 preserves 93-95%. For most practical use cases, INT4 is perfectly acceptable. Only research or evaluation tasks requiring exact reproducibility benefit from FP16.

Can I split Gemma 4 across multiple GPUs?

Yes. vLLM, llama.cpp, and other frameworks support tensor parallelism across multiple GPUs. This lets you run the 31B model at higher precision by splitting it across 2x RTX 4090s (48GB total) or similar configurations.

Ready to Deploy?

Now that you know the requirements, set up Gemma 4 on your hardware.

Deployment Guide Ollama Guide Download Models