Run Gemma 4 with MLX on Mac

MLX is Apple's machine learning framework purpose-built for Apple Silicon (M1, M2, M3, M4). It leverages the unified memory architecture of Apple chips to deliver exceptional inference performance — often outperforming GPU-based setups for models that fit in memory.

Gemma 4 works excellently with MLX, making any Mac with Apple Silicon a capable AI workstation. This guide covers installation, running all Gemma 4 variants, and optimizing performance on your Mac.

Why MLX for Gemma 4?

Unified Memory Advantage

Apple Silicon's unified memory architecture means no GPU VRAM limit — the entire system memory is available. A Mac with 64GB RAM can load and run models that would require a $1,500+ GPU on PC.

Native Optimization

MLX is built by Apple specifically for Apple Silicon, using Metal compute shaders and optimized memory access patterns. It consistently delivers better tokens-per-second than generic CPU inference.

Simple Setup

Install with pip, download a model, and start generating. No CUDA drivers, no Docker containers, no complex environment setup required.

Energy Efficiency

Apple Silicon's efficiency means you can run Gemma 4 for hours on battery. Ideal for developers who want local AI without being tethered to a wall outlet.

Installation

Install mlx-lm, Apple's model serving library for MLX:

pip install mlx-lm

# Verify installation
python -c "import mlx_lm; print('MLX-LM ready')

Running Gemma 4 Models

Which Gemma 4 models run well on which Macs:

MacBook Air/Pro (8GB)

E2B, E4B (INT4)

E4B at INT4 quantization fits comfortably. Leave room for OS and apps.

MacBook Pro (16-18GB)

E2B, E4B, 26B MoE (INT4)

26B MoE at INT4 (~16GB) fits but leaves little headroom. E4B is the sweet spot.

MacBook Pro / Mac Studio (36-48GB)

All models (INT4/INT8)

Comfortable for 26B at INT8. 31B at INT4 fits with room to spare.

Mac Studio / Mac Pro (64-192GB)

All models (all precisions)

Can run 31B at FP16. The ultimate Gemma 4 workstation.

MLX Commands

Text Generation

mlx_lm.generate \
  --model mlx-community/gemma-4-e4b-it-4bit \
  --prompt "Explain quantum computing in simple terms" \
  --max-tokens 512

Interactive Chat

mlx_lm.chat --model mlx-community/gemma-4-e4b-it-4bit

Start API Server

mlx_lm.server \
  --model mlx-community/gemma-4-e4b-it-4bit \
  --port 8080

# Then use the OpenAI-compatible API:
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gemma-4-e4b", "messages": [{"role": "user", "content": "Hello!"}]}'

Quantize a Model

mlx_lm.convert \
  --hf-path google/gemma-4-e4b-it \
  --mlx-path ./gemma-4-e4b-4bit \
  -q --q-bits 4

Performance Tips

Close Memory-Heavy Apps

Safari, Chrome, and Docker can consume significant RAM. Close them before running larger models to maximize available memory for MLX.

Use Quantized Models

Always use INT4 or INT8 quantized models on machines with ≤32GB RAM. The quality difference is minimal but the memory savings are substantial.

Adjust Context Length

Longer context windows consume more memory. If you're running low on RAM, reduce the max context length to free up memory for the model weights.

Monitor Memory Pressure

Use Activity Monitor to watch memory pressure. If it turns yellow/red, the system is swapping to disk and inference will slow dramatically. Consider a smaller model or more quantization.

MLX + Gemma 4 FAQ

Can I run Gemma 4 on an Intel Mac?

MLX requires Apple Silicon (M1 or later). For Intel Macs, use Ollama or llama.cpp instead, which support CPU inference on any Mac.

How fast is Gemma 4 on MLX?

Performance varies by model and hardware: E4B on M3 Pro achieves ~30-40 tokens/second. 26B MoE on M3 Max gets ~15-20 tok/s. 31B on M2 Ultra delivers ~10-15 tok/s. These speeds are excellent for interactive use.

MLX vs Ollama on Mac — which is better?

Both are excellent on Mac. Ollama is simpler (one-command setup) and includes a built-in API server. MLX offers more control, better memory efficiency, and often slightly faster inference. For most users, start with Ollama; switch to MLX for maximum performance.

Where do MLX models come from?

MLX-format models are available on Hugging Face, often uploaded by the mlx-community organization. You can also convert any SafeTensors model to MLX format using mlx-lm's conversion tools.

Can I fine-tune Gemma 4 with MLX?

Yes. mlx-lm supports LoRA fine-tuning on Apple Silicon. This lets you customize Gemma 4 for your domain directly on your Mac without needing a separate GPU server.

How much RAM do I need for Gemma 4 on Mac?

Minimum 8GB for E4B at INT4. 16GB for comfortable E4B/26B INT4 use. 36-48GB for 31B at INT4. 64GB+ for 31B at FP16. Remember that macOS itself uses 3-5GB, so plan accordingly.

Start Running Gemma 4 on Your Mac

Your Mac is ready for AI. Install MLX, download Gemma 4, and start generating.

Download Models Try Ollama Instead Try Online First