Gemma 4

Deployment Guide

Run Gemma 4 locally on your own hardware. Multiple deployment options from one-click installers to production-grade serving frameworks.

Ollama

The simplest way to run Gemma 4 locally. One command to download and serve any variant with automatic hardware optimization.

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

Run Model

# Gemma 4 31B (Dense) - 最强性能
ollama run gemma4:31b

# Gemma 4 26B (MoE) - 效率优先
ollama run gemma4:26b

# Gemma 4 E4B - 移动/轻量
ollama run gemma4:e4b

# Gemma 4 E2B - 边缘设备
ollama run gemma4:e2b

LM Studio

Desktop application with a visual interface for downloading, configuring, and chatting with Gemma 4 models. Great for beginners.

  1. Download LM Studio from lmstudio.ai
  2. Search for "Gemma 4" in the model browser
  3. Select a quantized version matching your VRAM
  4. Click Download and wait for completion
  5. Start chatting in the built-in interface

vLLM

High-throughput production serving engine with PagedAttention, continuous batching, and OpenAI-compatible API endpoints.

pip install vllm
vllm serve google/gemma-4-31b --max-model-len 32768

llama.cpp

Optimized C++ inference engine supporting GGUF quantized models. Run Gemma 4 on CPU or mixed CPU/GPU configurations.

# Build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && cmake -B build && cmake --build build

# Run with GGUF model
./build/bin/llama-cli -m gemma-4-31b-Q4_K_M.gguf -p "Hello"

MLX

Apple Silicon-native framework by Apple. Optimized for M-series chips with unified memory, delivering excellent performance on Mac hardware.

pip install mlx-lm
mlx_lm.generate --model google/gemma-4-31b --prompt "Hello"

VRAM Requirements

Estimated VRAM usage for each model variant at different quantization levels.

ModelBF16INT8INT4
E2B4 GB2.5 GB1.5 GB
E4B8 GB5 GB3 GB
26B MoE52 GB28 GB16 GB
31B Dense62 GB33 GB18 GB

Download Models

Get Gemma 4 model weights from official sources.