KoboldCpp is a user-friendly, cross-platform inference engine based on llama.cpp with a built-in web interface. It's one of the easiest ways to run Gemma 4 GGUF models locally — especially popular among creative writing, roleplay, and interactive fiction communities.
Unlike command-line tools, KoboldCpp provides a graphical launcher and a browser-based chat UI out of the box. It supports CPU, CUDA (NVIDIA), ROCm (AMD), Vulkan, and Metal (Apple) acceleration, making it work on virtually any hardware.
Get the latest release from GitHub:
pages.koboldcpp.koboldcppPage.install.windows.desc
pages.koboldcpp.koboldcppPage.install.mac.desc
pages.koboldcpp.koboldcppPage.install.linux.desc
pages.koboldcpp.koboldcppPage.download.subtitle
pages.koboldcpp.koboldcppPage.download.items.0.desc
pages.koboldcpp.koboldcppPage.download.items.1.desc
pages.koboldcpp.koboldcppPage.download.items.2.desc
pages.koboldcpp.koboldcppPage.download.items.3.desc
Double-click KoboldCpp to open the launcher. Select your GGUF file, configure GPU layers, and click Launch.
Or launch from the terminal with more control:
koboldcpp --model gemma-4-e4b-it-Q4_K_M.gguf --gpulayers 33 --contextsize 4096Start with 4096. Increase if you need longer conversations. Higher values use more RAM.
Set to the maximum your GPU can handle. More layers = faster inference. 0 = CPU only.
For CPU inference. Leave 1 core for system overhead.
Default works well. Increase for faster prompt processing if you have RAM to spare.
KoboldCpp exposes both the Kobold API and an OpenAI-compatible API. Use with SillyTavern, Agnaistic, or any compatible frontend:
curl http://localhost:5001/api/v1/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "Write a Python function to sort a list",
"max_length": 200,
"temperature": 0.7
}'curl http://localhost:5001/api/v1/modelKoboldCpp is an open-source, cross-platform inference engine with a built-in web UI. It's based on llama.cpp and supports GGUF models. Popular for creative writing, roleplay, and local AI chat.
For most users, gemma-4-e4b-it-Q4_K_M.gguf (~3GB) offers the best balance. If you have a GPU with 24GB+ VRAM, the 31B Q4 model provides flagship quality.
Yes. KoboldCpp is one of the most popular backends for SillyTavern. Connect via the Kobold API at localhost:5001 or the OpenAI-compatible endpoint.
Ollama is simpler for quick setup and API-first usage. KoboldCpp excels with its built-in UI, advanced sampler settings, and compatibility with chat frontends like SillyTavern. Choose based on your workflow.
KoboldCpp primarily focuses on text generation. For multimodal features (image/video/audio input), use Ollama or vLLM instead.
Maximize GPU layer offloading. Use a quantized model (Q4_K_M or Q5_K_M). Enable CUDA/Metal/Vulkan in the launcher. Reduce context size if not needed.
pages.koboldcpp.koboldcppPage.faq.items.6.a
pages.koboldcpp.koboldcppPage.faq.items.7.a
pages.koboldcpp.koboldcppPage.faq.items.8.a
pages.koboldcpp.koboldcppPage.faq.items.9.a
Download KoboldCpp, grab a Gemma 4 GGUF file, and start chatting in minutes.