Glossary · Definition
Perplexity (AI metric)
In AI/ML, perplexity is a measure of how 'surprised' a language model is by a piece of text. Computed as 2^cross-entropy. Lower = better — the model assigns higher probability to the actual text.
Definition
In AI/ML, perplexity is a measure of how 'surprised' a language model is by a piece of text. Computed as 2^cross-entropy. Lower = better — the model assigns higher probability to the actual text.
What it means
Perplexity is computed on a held-out evaluation set. A model with perplexity 5 means it's roughly 'as confused as' uniformly choosing among 5 next tokens at each step. Modern frontier LLMs have perplexity in the low single digits on standard benchmarks (WikiText-103, The Pile). Lower is always better — but absolute perplexity is hard to compare across tokenizers (different vocab sizes affect the number).
Advertisement
Why it matters
Perplexity is a quick research-grade quality metric for language models. Falling out of favor for production decisions (where MMLU, SWE-bench, custom evals are more relevant) but still useful for: comparing fine-tunes of the same base model, monitoring training progress, sanity-checking that a model hasn't degraded.
Related free tools
Frequently asked questions
Why low perplexity matters?
It correlates with lots of downstream qualities (better generation, better reasoning) but isn't a perfect proxy. A model can have low perplexity and bad chat behavior.
Same as Perplexity.ai?
No — that's a search engine company that took the name. Different things.
Related terms
- DefinitionLLM (Large Language Model)An LLM (Large Language Model) is a transformer-based neural network trained on huge text datasets to predict the next token. ChatGPT, Claude, Gemini, DeepSeek — all are LLMs.
- DefinitionEvals (AI evaluation)Evals are systematic tests of AI model quality — graded test sets that measure performance on specific tasks. Critical for picking models, validating fine-tunes, and not shipping regressions.