AI & LLMs · Guide · AI & Prompt Tools
How to Use Ollama
Installing Ollama, pulling models (llama3.1, qwen, mistral), running via CLI and REST API, GPU acceleration, and Modelfiles.
Ollama packages the heaviest part of running a local LLM — weights, runtime, quantization — into a single binary with a one-command install. This guide walks through installing it, pulling a model, and running real prompts against the local API.
Advertisement
What Ollama actually is
Ollama is a local model server written in Go that wraps llama.cpp under the hood. It downloads quantized GGUF weights, spins up an HTTP server on localhost:11434, and exposes a CLI plus an OpenAI-compatible API. You talk to it the same way you talk to OpenAI — just point your client at the local endpoint instead of api.openai.com.
The big win is that the model, runtime, template, and parameters are bundled into a single named artifact (similar to a Docker image). You pull llama3.1:8b and you get a working model, not a folder of files you have to stitch together.
Installing Ollama
On macOS or Linux, a single curl command gets you the binary:
curl -fsSL https://ollama.com/install.sh | sh
On Windows, grab the installer from ollama.com. On Linux servers, the install script also registers a systemd unit so the daemon survives reboots. Verify the install:
ollama --version systemctl status ollama # Linux only
Pulling and running your first model
Pick a model based on your RAM. For a 16GB laptop, Llama 3.1 8B quantized to Q4 is the sweet spot. For 8GB machines, drop to Phi-3 Mini or Qwen 2.5 3B. For 32GB+, Mistral Small or Llama 3.1 70B (heavily quantized) become viable.
ollama pull llama3.1:8b ollama run llama3.1:8b "Explain CRDTs in two sentences."
The first run streams tokens to your terminal. Subsequent runs reuse the loaded model from memory until it idles out (five minutes by default).
Using the HTTP API
Every model you pull is reachable over HTTP. The native endpoint is /api/generate, and there is also an OpenAI-compatible /v1/chat/completions for drop-in SDK usage:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Summarize the CAP theorem.",
"stream": false
}'With the OpenAI SDK, just swap the base URL and use any string for the API key:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
r = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "hi"}],
)Picking the right quantization
GGUF models ship in multiple quantizations. Q4_K_M is a good default — roughly 4 bits per weight with minimal quality loss. Q8_0 is near-lossless but doubles memory. Q2_K is aggressive and visibly degrades output on reasoning tasks. Ollama’s default tags usually point at a sane Q4 variant, but you can pin explicitly:
ollama pull llama3.1:8b-instruct-q8_0
Also tune context size with OLLAMA_CONTEXT_LENGTH or a Modelfile — the default of 2048 tokens is small, and models like Llama 3.1 support 128k natively.
When Ollama is the wrong choice
If you need GPU-saturating throughput for a production inference workload, Ollama is fine for prototypes but you will outgrow it — move to vLLM, TGI, or SGLang for batched serving. Ollama also does not do multi-GPU tensor parallelism well. For personal daily use, offline coding assistance, privacy-first RAG prototypes, and CI-friendly test fixtures, Ollama is the path of least resistance.
Advertisement