AI & LLMs · Guide · AI & Prompt Tools
How to Use llama.cpp
Building llama.cpp, downloading GGUF models, running llama-cli and llama-server, quantization tradeoffs, Metal/CUDA.
llama.cpp is the C++ inference engine that most of the local-LLM ecosystem — Ollama, LM Studio, Jan, GPT4All — is built on. Using it directly gives you the fastest path to running GGUF weights on CPUs, Apple Silicon, and GPUs with minimal overhead.
Advertisement
What llama.cpp is
llama.cpp is Georgi Gerganov’s single-repo C/C++ implementation of Llama-family inference. It supports dozens of model architectures (Llama 2/3, Mistral, Qwen, Phi, Gemma, DeepSeek, and more), quantizes them to GGUF, and runs on CPU, CUDA, Metal, Vulkan, ROCm, and SYCL. The project ships a CLI (llama-cli), a server (llama-server), and bindings for Python, Go, Rust, and Node.
Every other “easy” local-LLM tool eventually bottoms out here. Knowing llama.cpp directly means you can skip the wrappers when they get in your way.
Building from source
Clone the repo and build with CMake. The default build is CPU-only; pass flags for your accelerator:
git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON # NVIDIA # cmake -B build -DGGML_METAL=ON # Apple Silicon (on by default) # cmake -B build -DGGML_VULKAN=ON # AMD / Intel / cross-GPU cmake --build build --config Release -j
The binaries land under build/bin/. On macOS you can also install via brew install llama.cpp for a Metal-enabled prebuilt.
Getting a GGUF model
Pull a pre-quantized GGUF from Hugging Face. The bartowski and TheBloke accounts publish high-quality conversions for most popular base models:
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \ Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \ --local-dir ./models
If you have raw Hugging Face weights, convert them yourself with convert_hf_to_gguf.py and quantize with the llama-quantize binary.
Running inference
Single-shot prompt from the CLI:
./build/bin/llama-cli \ -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \ -p "Write a haiku about distributed systems." \ -n 128 -ngl 99
-ngl 99 offloads all layers to the GPU. For an OpenAI-compatible server, use llama-server:
./build/bin/llama-server \ -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \ --host 0.0.0.0 --port 8080 -ngl 99 -c 8192
The server exposes /v1/chat/completions, /v1/embeddings, and a built-in web UI at the root URL.
Picking quantization and context
The standard quantization grid is Q2_K through Q8_0, with K_M and K_S variants. For most 7B–13B models, Q4_K_M is the right default. For code and reasoning, bump to Q5_K_M or Q6_K if memory allows — Q4 noticeably hurts math and code accuracy.
The -c flag sets context size. Do not crank it past what you need — KV cache grows linearly with context and eats VRAM fast. Use --flash-attn to cut the overhead when supported.
When to reach past llama.cpp
llama.cpp is unbeatable for single-user inference on commodity hardware, and its server is fine for small internal tools. For high-concurrency production serving with continuous batching and paged attention, switch to vLLM or SGLang. For training or fine-tuning, use PyTorch + transformers or Unsloth — llama.cpp is an inference engine, not a trainer.
Advertisement