Glossary · Definition
VRAM
VRAM (Video RAM) is the memory on your GPU. It determines which AI models you can run locally — the model + KV cache + activations all need to fit. The single most-relevant hardware spec for local AI.
Definition
VRAM (Video RAM) is the memory on your GPU. It determines which AI models you can run locally — the model + KV cache + activations all need to fit. The single most-relevant hardware spec for local AI.
What it means
Approximate VRAM needs: 7B model Q4 = 6 GB; 13B Q4 = 10 GB; 32B Q4 = 22 GB; 70B Q4 = 42 GB. Plus 1-5 GB for KV cache depending on context length. Consumer GPUs in 2026: RTX 4090 = 24 GB, RTX 5090 = 32 GB, Apple Silicon unified memory ranges 16-192 GB but slower bandwidth. For models too big for one GPU, you split via tensor parallelism (multiple GPUs in one machine, fast) or pipeline parallelism (multiple machines, slower).
Advertisement
Why it matters
If you're buying hardware for local AI, VRAM is the single most-impactful number. A 4090 (24 GB) vs 4080 (16 GB) is the difference between running 32B vs only 13B models. Mac Studio with 192 GB unified memory hosts 70B+ models that no consumer Nvidia GPU can fit alone.
Related free tools
Frequently asked questions
Can I split a model across GPUs?
Yes — tensor parallelism within one machine via vLLM/TGI; pipeline parallelism across machines via llama.cpp RPC, exo, or Hyperspace pods.
Apple Silicon vs Nvidia?
Mac Studio 192 GB hosts huge models due to unified memory. Nvidia GPUs are faster per-GB but limited by VRAM. Different tradeoffs.
Related terms
- DefinitionQuantizationQuantization compresses AI model weights from 16-bit floats (FP16) to lower bit-widths — Q8, Q5, Q4, Q3 — letting larger models fit on smaller hardware at modest quality cost.
- DefinitionMoE (Mixture of Experts)MoE (Mixture of Experts) is an AI architecture where the model has many specialized sub-networks ('experts') and only activates a few per token. Lets the model be huge in total parameters but cheap to run.
- DefinitionInference (AI)Inference is the process of running a trained AI model to generate predictions or outputs — distinct from training (which builds the model) or fine-tuning (which adapts it).