Skip to content
Free Tool Arena

Glossary · Definition

Knowledge distillation

Knowledge distillation trains a small 'student' model to imitate a larger 'teacher' model's outputs. Used to ship cheap, fast versions of frontier models — DeepSeek-Distill-Qwen, Phi-4, Gemini Flash, etc.

Updated May 2026 · 4 min read
100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →

Definition

Knowledge distillation trains a small 'student' model to imitate a larger 'teacher' model's outputs. Used to ship cheap, fast versions of frontier models — DeepSeek-Distill-Qwen, Phi-4, Gemini Flash, etc.

What it means

Original paper: Hinton et al. 2015. The student is trained on the teacher's output distribution (soft targets) rather than hard labels — preserving more information about the teacher's 'opinion' on edge cases. Modern variants distill multiple teachers, distill specific capabilities (reasoning, math, code), or distill from one architecture into another. Frontier models often ship their distilled cousins: DeepSeek R1 → DeepSeek-R1-Distill-Qwen-32B; Gemini Pro → Gemini Flash.

Advertisement

Why it matters

Distillation is how frontier-quality intelligence gets cheap. A distilled 32B model can deliver 80-90% of a 671B teacher's quality at 5-10% of inference cost. For self-host on consumer hardware, distilled models are usually the best practical option.

Related free tools

Frequently asked questions

Best distilled models in 2026?

DeepSeek-R1-Distill-Qwen-32B (reasoning), Phi-4 (general 14B), Gemini Flash (multimodal). All run well on a single H100 or 24-32GB VRAM.

Distillation vs LoRA?

Different goals. LoRA: adapt a model to your task/style. Distillation: shrink a model to a smaller architecture entirely.

Related terms