AI & LLMs · Guide · AI & Prompt Tools

How to Use Ollama

Install Ollama, download models like Llama 3, Qwen, and Mistral, and run them via CLI or REST API with GPU acceleration. Get started instantly for free.

By FreeToolArena Staff · Updated June 2026 · 6 min read

Ollama packages the heaviest part of running a local LLM — weights, runtime, quantization — into a single binary with a one-command install. This guide walks through installing it, pulling a model, and running real prompts against the local API.

What Ollama actually is

Ollama is a local model server written in Go that wraps llama.cpp under the hood. It downloads quantized GGUF weights, spins up an HTTP server on localhost:11434, and exposes a CLI plus an OpenAI-compatible API. You talk to it the same way you talk to OpenAI — just point your client at the local endpoint instead of api.openai.com.

The big win is that the model, runtime, template, and parameters are bundled into a single named artifact (similar to a Docker image). You pull llama3.1:8b and you get a working model, not a folder of files you have to stitch together.

Installing Ollama

On macOS or Linux, a single curl command gets you the binary:

curl -fsSL https://ollama.com/install.sh | sh

On Windows, grab the installer from ollama.com. On Linux servers, the install script also registers a systemd unit so the daemon survives reboots. Verify the install:

ollama --version
systemctl status ollama   # Linux only

Pulling and running your first model

Pick a model based on your RAM. For a 16GB laptop, Llama 3.1 8B quantized to Q4 is the sweet spot. For 8GB machines, drop to Phi-3 Mini or Qwen 2.5 3B. For 32GB+, Mistral Small or Llama 3.1 70B (heavily quantized) become viable.

ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain CRDTs in two sentences."

The first run streams tokens to your terminal. Subsequent runs reuse the loaded model from memory until it idles out (five minutes by default).

Using the HTTP API

Every model you pull is reachable over HTTP. The native endpoint is /api/generate, and there is also an OpenAI-compatible /v1/chat/completions for drop-in SDK usage:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Summarize the CAP theorem.",
  "stream": false
}'

With the OpenAI SDK, just swap the base URL and use any string for the API key:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
r = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "hi"}],
)

Picking the right quantization

GGUF models ship in multiple quantizations. Q4_K_M is a good default — roughly 4 bits per weight with minimal quality loss. Q8_0 is near-lossless but doubles memory. Q2_K is aggressive and visibly degrades output on reasoning tasks. Ollama’s default tags usually point at a sane Q4 variant, but you can pin explicitly:

ollama pull llama3.1:8b-instruct-q8_0

Also tune context size with OLLAMA_CONTEXT_LENGTH or a Modelfile — the default of 2048 tokens is small, and models like Llama 3.1 support 128k natively.

When Ollama is the wrong choice

If you need GPU-saturating throughput for a production inference workload, Ollama is fine for prototypes but you will outgrow it — move to vLLM, TGI, or SGLang for batched serving. Ollama also does not do multi-GPU tensor parallelism well. For personal daily use, offline coding assistance, privacy-first RAG prototypes, and CI-friendly test fixtures, Ollama is the path of least resistance.

Use these while you read

Tools that pair with this guide

Found this useful?Email Buy Me a Coffee

Continue reading

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →