Glossary · Definition

Quantization

Quantization compresses AI model weights from 16-bit floats (FP16) to lower bit-widths — Q8, Q5, Q4, Q3 — letting larger models fit on smaller hardware at modest quality cost.

Updated May 2026 · 4 min read

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →

Definition

Quantization compresses AI model weights from 16-bit floats (FP16) to lower bit-widths — Q8, Q5, Q4, Q3 — letting larger models fit on smaller hardware at modest quality cost.

What it means

FP16 is the default training precision (16 bits per weight). Quantization formats reduce this: Q8_0 (8 bits, ~99.5% quality), Q5_K_M (5 bits, ~98%), Q4_K_M (4 bits, ~96%), Q3_K_M (3 bits, ~93%), IQ2_XS (2 bits, ~88%). The most popular format is Q4_K_M — sweet spot for size vs quality. GGUF (llama.cpp) and AWQ (vLLM) are the main quantization frameworks.

Why it matters

Quantization is the key reason consumer hardware can run frontier-class models. Llama 70B at FP16 = 140 GB (won't fit anywhere consumer). At Q4_K_M = 42 GB (fits a Mac Studio Ultra or pooled across 4 laptops). The 4% quality loss is rarely noticeable in practice for most workloads.

Related free tools

Free toolLocal vs API Break-even CalculatorHow many months until self-hosting pays back vs using API? Compare Mac Studio, RTX 4090/5090, and Hyperspace pods at your usage level.Open tool →

Frequently asked questions

Best quantization?

Q4_K_M for almost everyone — best size/quality tradeoff. Step up to Q5_K_M if you have memory headroom. Avoid below Q3 unless desperate for size.

Does it slow down inference?

Slightly faster than FP16 in many cases (less memory bandwidth needed). Only marginal.

Related terms

DefinitionMoE (Mixture of Experts)MoE (Mixture of Experts) is an AI architecture where the model has many specialized sub-networks ('experts') and only activates a few per token. Lets the model be huge in total parameters but cheap to run.