Glossary · Definition
Quantization
Quantization compresses AI model weights from 16-bit floats (FP16) to lower bit-widths — Q8, Q5, Q4, Q3 — letting larger models fit on smaller hardware at modest quality cost.
Definition
Quantization compresses AI model weights from 16-bit floats (FP16) to lower bit-widths — Q8, Q5, Q4, Q3 — letting larger models fit on smaller hardware at modest quality cost.
What it means
FP16 is the default training precision (16 bits per weight). Quantization formats reduce this: Q8_0 (8 bits, ~99.5% quality), Q5_K_M (5 bits, ~98%), Q4_K_M (4 bits, ~96%), Q3_K_M (3 bits, ~93%), IQ2_XS (2 bits, ~88%). The most popular format is Q4_K_M — sweet spot for size vs quality. GGUF (llama.cpp) and AWQ (vLLM) are the main quantization frameworks.
Advertisement
Why it matters
Quantization is the key reason consumer hardware can run frontier-class models. Llama 70B at FP16 = 140 GB (won't fit anywhere consumer). At Q4_K_M = 42 GB (fits a Mac Studio Ultra or pooled across 4 laptops). The 4% quality loss is rarely noticeable in practice for most workloads.
Related free tools
Frequently asked questions
Best quantization?
Q4_K_M for almost everyone — best size/quality tradeoff. Step up to Q5_K_M if you have memory headroom. Avoid below Q3 unless desperate for size.
Does it slow down inference?
Slightly faster than FP16 in many cases (less memory bandwidth needed). Only marginal.