Run Gemma 4 with KoboldCpp

KoboldCpp is a user-friendly, cross-platform inference engine based on llama.cpp with a built-in web interface. It's one of the easiest ways to run Gemma 4 GGUF models locally — especially popular among creative writing, roleplay, and interactive fiction communities.

Unlike command-line tools, KoboldCpp provides a graphical launcher and a browser-based chat UI out of the box. It supports CPU, CUDA (NVIDIA), ROCm (AMD), Vulkan, and Metal (Apple) acceleration, making it work on virtually any hardware.

Step 1: Download KoboldCpp

Get the latest release from GitHub:

Step 2: Get Gemma 4 GGUF Files

Step 3: Launch KoboldCpp

GUI Launcher

Double-click KoboldCpp to open the launcher. Select your GGUF file, configure GPU layers, and click Launch.

Command Line

Or launch from the terminal with more control:

koboldcpp --model gemma-4-e4b-it-Q4_K_M.gguf --gpulayers 33 --contextsize 4096

Recommended Settings

Start with 4096. Increase if you need longer conversations. Higher values use more RAM.

Set to the maximum your GPU can handle. More layers = faster inference. 0 = CPU only.

For CPU inference. Leave 1 core for system overhead.

Default works well. Increase for faster prompt processing if you have RAM to spare.

API Integration

KoboldCpp exposes both the Kobold API and an OpenAI-compatible API. Use with SillyTavern, Agnaistic, or any compatible frontend:

curl http://localhost:5001/api/v1/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Write a Python function to sort a list",
    "max_length": 200,
    "temperature": 0.7
  }'

curl http://localhost:5001/api/v1/model

KoboldCpp + Gemma 4 FAQ

What is KoboldCpp?

KoboldCpp is an open-source, cross-platform inference engine with a built-in web UI. It's based on llama.cpp and supports GGUF models. Popular for creative writing, roleplay, and local AI chat.

Which Gemma 4 model works best with KoboldCpp?

For most users, gemma-4-e4b-it-Q4_K_M.gguf (~3GB) offers the best balance. If you have a GPU with 24GB+ VRAM, the 31B Q4 model provides flagship quality.

Can I use KoboldCpp with SillyTavern?

Yes. KoboldCpp is one of the most popular backends for SillyTavern. Connect via the Kobold API at localhost:5001 or the OpenAI-compatible endpoint.

KoboldCpp vs Ollama — which should I use?

Ollama is simpler for quick setup and API-first usage. KoboldCpp excels with its built-in UI, advanced sampler settings, and compatibility with chat frontends like SillyTavern. Choose based on your workflow.

Does KoboldCpp support Gemma 4 multimodal?

KoboldCpp primarily focuses on text generation. For multimodal features (image/video/audio input), use Ollama or vLLM instead.

How do I get faster inference?

Maximize GPU layer offloading. Use a quantized model (Q4_K_M or Q5_K_M). Enable CUDA/Metal/Vulkan in the launcher. Reduce context size if not needed.

Get Started with KoboldCpp

Download KoboldCpp, grab a Gemma 4 GGUF file, and start chatting in minutes.

Download GGUF Models Try Ollama Instead All Deploy Options