AI & LLMs · Guide · AI & Prompt Tools
How to Deploy Llama Locally
30-minute setup: Ollama install, pick a Llama model that fits your hardware, expose an OpenAI-compatible API. Speed expectations by hardware.
Updated May 2026 · 6 min read
Running Llama 3.3 or Llama 4 locally costs $0 in marginal cost, gives you full privacy, and works offline. The path is simpler in 2026 than it sounds — here’s the 30-minute setup.
Advertisement
Step 1: install Ollama
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download installer at ollama.comStep 2: pull a Llama model that fits your machine
- 16 GB RAM:
ollama run llama3.2:3b— fast, useful, surprisingly capable. - 32 GB RAM:
ollama run llama3.3:8borllama4:scout. - 64 GB RAM:
ollama run llama3.3:70b-q4_K_M— the flagship, slow but excellent. - 192 GB unified (Mac Studio Ultra):
llama4:maverick— full MoE flagship.
Step 3: chat or expose API
Type at the >>> prompt for chat. To expose an OpenAI-compatible API on your LAN:OLLAMA_HOST=0.0.0.0:11434 ollama serve. Point Cursor / Continue.dev at http://your-ip:11434/v1.
Speed expectations
- Llama 3.3 8B on a 4090 / M-series Mac: 60-90 tokens/sec.
- Llama 3.3 70B Q4 on Mac Studio M2 Ultra: 12-16 tokens/sec.
- Llama 3.3 70B Q4 on RTX 4090 + 64 GB DDR5 (offload): 8-12 tokens/sec.
For multi-machine pooling, see how to set up a hyperspace pod. Open weight tracker at open-source LLM tracker.
Advertisement
Found this useful?Email