Glossary · Definition

Knowledge distillation

Knowledge distillation trains a small 'student' model to imitate a larger 'teacher' model's outputs. Used to ship cheap, fast versions of frontier models — DeepSeek-Distill-Qwen, Phi-4, Gemini Flash, etc.

Updated May 2026 · 4 min read

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →

Definition

What it means

Original paper: Hinton et al. 2015. The student is trained on the teacher's output distribution (soft targets) rather than hard labels — preserving more information about the teacher's 'opinion' on edge cases. Modern variants distill multiple teachers, distill specific capabilities (reasoning, math, code), or distill from one architecture into another. Frontier models often ship their distilled cousins: DeepSeek R1 → DeepSeek-R1-Distill-Qwen-32B; Gemini Pro → Gemini Flash.

Why it matters

Distillation is how frontier-quality intelligence gets cheap. A distilled 32B model can deliver 80-90% of a 671B teacher's quality at 5-10% of inference cost. For self-host on consumer hardware, distilled models are usually the best practical option.

Related free tools

Free toolOpen-Source LLM TrackerLive tracker of 15 open-weight LLMs: Llama 3.3/4, Qwen 3.5, DeepSeek V3.2/R1, Kimi K2, Mistral Large 3, Gemma 3, Phi-4, SmolLM3. Filter by license.Open tool →

Frequently asked questions

Best distilled models in 2026?

DeepSeek-R1-Distill-Qwen-32B (reasoning), Phi-4 (general 14B), Gemini Flash (multimodal). All run well on a single H100 or 24-32GB VRAM.

Distillation vs LoRA?

Different goals. LoRA: adapt a model to your task/style. Distillation: shrink a model to a smaller architecture entirely.

What it means

Why it matters

Related free tools

Frequently asked questions

Related terms