AI & LLMs · Guide · AI & Prompt Tools
How to Use llama.cpp
Build llama.cpp from source, download GGUF files, and serve models via CLI or REST API with Metal or CUDA acceleration in your browser.
llama.cpp is the C++ inference engine that most of the local-LLM ecosystem — Ollama, LM Studio, Jan, GPT4All — is built on. Using it directly gives you the fastest path to running GGUF weights on CPUs, Apple Silicon, and GPUs with minimal overhead.
Advertisement
What llama.cpp is
llama.cpp is Georgi Gerganov’s single-repo C/C++ implementation of Llama-family inference. It supports dozens of model architectures (Llama 2/3, Mistral, Qwen, Phi, Gemma, DeepSeek, and more), quantizes them to GGUF, and runs on CPU, CUDA, Metal, Vulkan, ROCm, and SYCL. The project ships a CLI (llama-cli), a server (llama-server), and bindings for Python, Go, Rust, and Node.
Every other “easy” local-LLM tool eventually bottoms out here. Knowing llama.cpp directly means you can skip the wrappers when they get in your way.
Building from source
Clone the repo and build with CMake. The default build is CPU-only; pass flags for your accelerator:
git clone https://github.com/ggml-org/llama.cpp cd llama.cpp cmake -B build -DGGML_CUDA=ON # NVIDIA # cmake -B build -DGGML_METAL=ON # Apple Silicon (on by default) # cmake -B build -DGGML_VULKAN=ON # AMD / Intel / cross-GPU cmake --build build --config Release -j
The binaries land under build/bin/. On macOS you can also install via brew install llama.cpp for a Metal-enabled prebuilt.
Getting a GGUF model
Pull a pre-quantized GGUF from Hugging Face. The bartowski and TheBloke accounts publish high-quality conversions for most popular base models:
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \ Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \ --local-dir ./models
If you have raw Hugging Face weights, convert them yourself with convert_hf_to_gguf.py and quantize with the llama-quantize binary.
Running inference
Single-shot prompt from the CLI:
./build/bin/llama-cli \ -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \ -p "Write a haiku about distributed systems." \ -n 128 -ngl 99
-ngl 99 offloads all layers to the GPU. For an OpenAI-compatible server, use llama-server:
./build/bin/llama-server \ -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \ --host 0.0.0.0 --port 8080 -ngl 99 -c 8192
The server exposes /v1/chat/completions, /v1/embeddings, and a built-in web UI at the root URL.
Picking quantization and context
The standard quantization grid is Q2_K through Q8_0, with K_M and K_S variants. For most 7B–13B models, Q4_K_M is the right default. For code and reasoning, bump to Q5_K_M or Q6_K if memory allows — Q4 noticeably hurts math and code accuracy.
The -c flag sets context size. Do not crank it past what you need — KV cache grows linearly with context and eats VRAM fast. Use --flash-attn to cut the overhead when supported.
When to reach past llama.cpp
llama.cpp is unbeatable for single-user inference on commodity hardware, and its server is fine for small internal tools. For high-concurrency production serving with continuous batching and paged attention, switch to vLLM or SGLang. For training or fine-tuning, use PyTorch + transformers or Unsloth — llama.cpp is an inference engine, not a trainer.
Use these while you read
Tools that pair with this guide
- LLM Context Window CalculatorCheck if your tokens fit GPT-4o, Claude, Gemini, Llama, or Mistral context windows — see headroom and percent used. Free, instant, browser-only.AI & Prompt Tools
- AI Prompt GeneratorTurn a vague idea into a structured prompt. Pick role, task, context, constraints, and output format. Works with ChatGPT, Claude, and Gemini.AI & Prompt Tools
- AI Token CounterEstimate tokens, characters, words, and approximate API cost for GPT-4o, GPT-4, Claude, and Gemini — before you hit send.AI & Prompt Tools
- AI Prompt LibraryBrowse a curated catalog of prompt templates for writing, coding, marketing, and research. One click to copy.AI & Prompt Tools
Advertisement
Continue reading
- AI & LLMsGitHub Copilot Pricing and ComparisonCompare free vs paid GitHub Copilot tiers and analyze it against ChatGPT, Cursor, and Tabnine. Find the best value plan instantly with this free online guide.
- AI & LLMsGitHub Copilot Features and CapabilitiesTest what Copilot really does — code accuracy, scope limits, debugging, web dev, legacy code, tests, docs, team customization. Free guide, no sign-up.
- AI & LLMsGitHub Copilot Security and Data HandlingAudit where your code goes, who sees it, training-data policy, network needs, and what happens when Copilot suggests broken code. Free, no sign-up.
- AI & LLMsAI Fluency SkillsThe 8 sub-skills of AI fluency: prompt structure, model selection, tool use, quality calibration, iteration, context management, cost awareness, privacy.
- AI & LLMsAnthropic Skills ExplainedSkills as Anthropic's answer to Custom GPTs — markdown-defined, version-controlled in git, work in terminal. Anatomy + Skills vs Custom GPTs.
- AI & LLMsKimi K2 vs DeepSeek V3Two open-weight Chinese flagships. Kimi K2 = 1M context, DeepSeek V3.2 = top-tier reasoning + coding. Pick by use case.