Glossary · Definition

VRAM

VRAM (Video RAM) is the memory on your GPU. It determines which AI models you can run locally — the model + KV cache + activations all need to fit. The single most-relevant hardware spec for local AI.

Updated May 2026 · 4 min read

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →

Definition

What it means

Approximate VRAM needs: 7B model Q4 = 6 GB; 13B Q4 = 10 GB; 32B Q4 = 22 GB; 70B Q4 = 42 GB. Plus 1-5 GB for KV cache depending on context length. Consumer GPUs in 2026: RTX 4090 = 24 GB, RTX 5090 = 32 GB, Apple Silicon unified memory ranges 16-192 GB but slower bandwidth. For models too big for one GPU, you split via tensor parallelism (multiple GPUs in one machine, fast) or pipeline parallelism (multiple machines, slower).

Why it matters

If you're buying hardware for local AI, VRAM is the single most-impactful number. A 4090 (24 GB) vs 4080 (16 GB) is the difference between running 32B vs only 13B models. Mac Studio with 192 GB unified memory hosts 70B+ models that no consumer Nvidia GPU can fit alone.

Related free tools

Free toolLocal vs API Break-even CalculatorHow many months until self-hosting pays back vs using API? Compare Mac Studio, RTX 4090/5090, and Hyperspace pods at your usage level.Open tool →Free toolOpen-Source LLM TrackerLive tracker of 15 open-weight LLMs: Llama 3.3/4, Qwen 3.5, DeepSeek V3.2/R1, Kimi K2, Mistral Large 3, Gemma 3, Phi-4, SmolLM3. Filter by license.Open tool →

Frequently asked questions

Can I split a model across GPUs?

Yes — tensor parallelism within one machine via vLLM/TGI; pipeline parallelism across machines via llama.cpp RPC, exo, or Hyperspace pods.

Apple Silicon vs Nvidia?

Mac Studio 192 GB hosts huge models due to unified memory. Nvidia GPUs are faster per-GB but limited by VRAM. Different tradeoffs.

What it means

Why it matters

Related free tools

Frequently asked questions

Related terms