Skip to content
Free Tool Arena

Glossary · Definition

Quantization

Quantization compresses AI model weights from 16-bit floats (FP16) to lower bit-widths — Q8, Q5, Q4, Q3 — letting larger models fit on smaller hardware at modest quality cost.

Updated May 2026 · 4 min read
100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →

Definition

Quantization compresses AI model weights from 16-bit floats (FP16) to lower bit-widths — Q8, Q5, Q4, Q3 — letting larger models fit on smaller hardware at modest quality cost.

What it means

FP16 is the default training precision (16 bits per weight). Quantization formats reduce this: Q8_0 (8 bits, ~99.5% quality), Q5_K_M (5 bits, ~98%), Q4_K_M (4 bits, ~96%), Q3_K_M (3 bits, ~93%), IQ2_XS (2 bits, ~88%). The most popular format is Q4_K_M — sweet spot for size vs quality. GGUF (llama.cpp) and AWQ (vLLM) are the main quantization frameworks.

Advertisement

Why it matters

Quantization is the key reason consumer hardware can run frontier-class models. Llama 70B at FP16 = 140 GB (won't fit anywhere consumer). At Q4_K_M = 42 GB (fits a Mac Studio Ultra or pooled across 4 laptops). The 4% quality loss is rarely noticeable in practice for most workloads.

Related free tools

Frequently asked questions

Best quantization?

Q4_K_M for almost everyone — best size/quality tradeoff. Step up to Q5_K_M if you have memory headroom. Avoid below Q3 unless desperate for size.

Does it slow down inference?

Slightly faster than FP16 in many cases (less memory bandwidth needed). Only marginal.

Related terms