Skip to content
Free Tool Arena

AI & LLMs · Guide · AI & Prompt Tools

How to Deploy Llama Locally

30-minute setup: Ollama install, pick a Llama model that fits your hardware, expose an OpenAI-compatible API. Speed expectations by hardware.

Updated May 2026 · 6 min read

Running Llama 3.3 or Llama 4 locally costs $0 in marginal cost, gives you full privacy, and works offline. The path is simpler in 2026 than it sounds — here’s the 30-minute setup.

Advertisement

Step 1: install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer at ollama.com

Step 2: pull a Llama model that fits your machine

  • 16 GB RAM: ollama run llama3.2:3b — fast, useful, surprisingly capable.
  • 32 GB RAM: ollama run llama3.3:8b or llama4:scout.
  • 64 GB RAM: ollama run llama3.3:70b-q4_K_M — the flagship, slow but excellent.
  • 192 GB unified (Mac Studio Ultra): llama4:maverick — full MoE flagship.

Step 3: chat or expose API

Type at the >>> prompt for chat. To expose an OpenAI-compatible API on your LAN:OLLAMA_HOST=0.0.0.0:11434 ollama serve. Point Cursor / Continue.dev at http://your-ip:11434/v1.

Speed expectations

  • Llama 3.3 8B on a 4090 / M-series Mac: 60-90 tokens/sec.
  • Llama 3.3 70B Q4 on Mac Studio M2 Ultra: 12-16 tokens/sec.
  • Llama 3.3 70B Q4 on RTX 4090 + 64 GB DDR5 (offload): 8-12 tokens/sec.

For multi-machine pooling, see how to set up a hyperspace pod. Open weight tracker at open-source LLM tracker.

Advertisement

Found this useful?Email