Run Gemma 4 with Ollama

Ollama is the fastest and simplest way to run Gemma 4 on your own hardware. With a single command, you can download and start chatting with any Gemma 4 model variant — no Python environment, no complex setup, no GPU configuration required.

Ollama automatically detects your hardware (CPU, GPU, memory) and optimizes the model configuration for best performance. It supports macOS, Linux, and Windows, and provides an OpenAI-compatible API for easy integration into your applications.

Step 1: Install Ollama

macOS

Download from ollama.com or install via Homebrew:

# Homebrew
brew install ollama

# Or download from https://ollama.com/download/mac

Linux

One-line install script:

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com or use winget:

# winget
winget install Ollama.Ollama

# Or download from https://ollama.com/download/windows

Verify installation:

ollama --version

Step 2: Choose Your Gemma 4 Model

All Gemma 4 variants are available in the Ollama library. Choose based on your hardware and needs:

gemma4:e2b

~1.5 GBVRAM: 2 GB

Ultra-lightweight for edge devices and basic tasks

gemma4:e4b

~3 GBVRAM: 4 GB

Best balance of quality and resource usage

gemma4:26b

~15 GBVRAM: 16 GB

MoE architecture — large model quality at small model cost

gemma4:31b

~18 GBVRAM: 24 GB

Maximum quality — flagship dense model

Step 3: Run Gemma 4

Start an interactive chat session:

# Start interactive chat with Gemma 4 E4B
ollama run gemma4:e4b

# Or the flagship 31B model
ollama run gemma4:31b

Run a single prompt:

ollama run gemma4:e4b "Explain quantum computing in simple terms"

Use with images (multimodal):

# In the interactive chat, use /image to add images
ollama run gemma4:e4b
>>> /image photo.jpg What do you see in this image?

Using the Ollama API

Ollama provides an OpenAI-compatible REST API at localhost:11434, making it easy to integrate Gemma 4 into your applications:

Chat completion:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:e4b",
    "messages": [
      {"role": "user", "content": "Hello, Gemma 4!"}
    ]
  }'

Text generation:

curl http://localhost:11434/api/generate \
  -d '{
    "model": "gemma4:e4b",
    "prompt": "Write a Python function to sort a list"
  }'

Advanced Configuration

Custom Modelfile

Create a custom Modelfile to adjust model parameters like temperature, context length, and system prompt:

FROM gemma4:e4b

PARAMETER temperature 0.7
PARAMETER num_ctx 32768

SYSTEM """
You are a helpful coding assistant. Always provide code examples.
"""

GPU Configuration

Ollama auto-detects GPUs, but you can control GPU layer offloading:

# Set number of GPU layers
OLLAMA_NUM_GPU=35 ollama run gemma4:31b

# CPU only mode
OLLAMA_NUM_GPU=0 ollama run gemma4:e4b

Context Length

Increase the default context window for longer conversations:

ollama run gemma4:e4b --num-ctx 65536

Troubleshooting

Model download is slow

Ollama downloads from ollama.com CDN. If slow, check your internet connection or try a VPN. Large models (26B, 31B) may take 10-30 minutes depending on bandwidth.

Out of memory error

Try a smaller model variant or a quantized version. Use 'ollama run gemma4:e4b' instead of the 31B model. On systems with limited RAM, close other applications before running.

Slow inference speed

Ensure Ollama is using your GPU: check with 'ollama ps'. On Mac, Ollama uses Metal GPU acceleration automatically. On Linux/Windows, ensure NVIDIA or AMD GPU drivers are properly installed.

API connection refused

Make sure the Ollama service is running: 'ollama serve'. The default API endpoint is http://localhost:11434. Check firewall settings if accessing from another machine.

Ollama + Gemma 4 FAQ

What is the best Gemma 4 model to run with Ollama?

For most users, gemma4:e4b offers the best balance of quality and performance. If you have a GPU with 16GB+ VRAM, gemma4:26b provides near-flagship quality with efficient MoE inference. The gemma4:31b model requires 24GB+ VRAM but delivers maximum performance.

Can I run Gemma 4 on Ollama without a GPU?

Yes. Ollama supports CPU-only inference for all Gemma 4 variants. The E2B and E4B models run reasonably fast on CPU. Larger models will be significantly slower without GPU acceleration but still functional.

How do I update Gemma 4 in Ollama?

Run 'ollama pull gemma4:e4b' (or your preferred variant) to download the latest version. Ollama will only download the differences if you already have a previous version installed.

Can I use Ollama Gemma 4 with other tools?

Yes. Ollama's OpenAI-compatible API works with most AI tools and frameworks including LangChain, LlamaIndex, Open WebUI, Continue.dev, and many others. Just point them to http://localhost:11434.

Does Ollama support Gemma 4 multimodal features?

Yes. Ollama supports Gemma 4's multimodal capabilities. You can pass images to the model using the /image command in the interactive chat or via the API's image parameter.

How much disk space does Gemma 4 require in Ollama?

Disk space depends on the variant: E2B (~1.5GB), E4B (~3GB), 26B MoE (~15GB), 31B Dense (~18GB). These are for the default quantization. Models are stored in ~/.ollama/models on macOS/Linux.

ollamaGuide.faq.items.6.q

ollamaGuide.faq.items.6.a

ollamaGuide.faq.items.7.q

ollamaGuide.faq.items.7.a

ollamaGuide.faq.items.8.q

ollamaGuide.faq.items.8.a

ollamaGuide.faq.items.9.q

ollamaGuide.faq.items.9.a

Ready to Run Gemma 4?

Install Ollama and start chatting with Gemma 4 in minutes. Or explore other deployment options.

Install Ollama Other Deploy Options Try Online First