Skip to content
Free Tool Arena

AI & LLMs · Guide · AI & Prompt Tools

How to Use llama.cpp

Building llama.cpp, downloading GGUF models, running llama-cli and llama-server, quantization tradeoffs, Metal/CUDA.

Updated April 2026 · 6 min read

llama.cpp is the C++ inference engine that most of the local-LLM ecosystem — Ollama, LM Studio, Jan, GPT4All — is built on. Using it directly gives you the fastest path to running GGUF weights on CPUs, Apple Silicon, and GPUs with minimal overhead.

Advertisement

What llama.cpp is

llama.cpp is Georgi Gerganov’s single-repo C/C++ implementation of Llama-family inference. It supports dozens of model architectures (Llama 2/3, Mistral, Qwen, Phi, Gemma, DeepSeek, and more), quantizes them to GGUF, and runs on CPU, CUDA, Metal, Vulkan, ROCm, and SYCL. The project ships a CLI (llama-cli), a server (llama-server), and bindings for Python, Go, Rust, and Node.

Every other “easy” local-LLM tool eventually bottoms out here. Knowing llama.cpp directly means you can skip the wrappers when they get in your way.

Building from source

Clone the repo and build with CMake. The default build is CPU-only; pass flags for your accelerator:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON    # NVIDIA
# cmake -B build -DGGML_METAL=ON   # Apple Silicon (on by default)
# cmake -B build -DGGML_VULKAN=ON  # AMD / Intel / cross-GPU
cmake --build build --config Release -j

The binaries land under build/bin/. On macOS you can also install via brew install llama.cpp for a Metal-enabled prebuilt.

Getting a GGUF model

Pull a pre-quantized GGUF from Hugging Face. The bartowski and TheBloke accounts publish high-quality conversions for most popular base models:

huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

If you have raw Hugging Face weights, convert them yourself with convert_hf_to_gguf.py and quantize with the llama-quantize binary.

Running inference

Single-shot prompt from the CLI:

./build/bin/llama-cli \
  -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -p "Write a haiku about distributed systems." \
  -n 128 -ngl 99

-ngl 99 offloads all layers to the GPU. For an OpenAI-compatible server, use llama-server:

./build/bin/llama-server \
  -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 -ngl 99 -c 8192

The server exposes /v1/chat/completions, /v1/embeddings, and a built-in web UI at the root URL.

Picking quantization and context

The standard quantization grid is Q2_K through Q8_0, with K_M and K_S variants. For most 7B–13B models, Q4_K_M is the right default. For code and reasoning, bump to Q5_K_M or Q6_K if memory allows — Q4 noticeably hurts math and code accuracy.

The -c flag sets context size. Do not crank it past what you need — KV cache grows linearly with context and eats VRAM fast. Use --flash-attn to cut the overhead when supported.

When to reach past llama.cpp

llama.cpp is unbeatable for single-user inference on commodity hardware, and its server is fine for small internal tools. For high-concurrency production serving with continuous batching and paged attention, switch to vLLM or SGLang. For training or fine-tuning, use PyTorch + transformers or Unsloth — llama.cpp is an inference engine, not a trainer.

Advertisement

Found this useful?Email