AI & LLMs · Guide · AI & Prompt Tools
How to Evaluate an AI Model
Don't pick on benchmarks alone. Build a 30-task eval set, score with rubrics, spot-check failures. Tools: OpenAI Evals, Promptfoo, Braintrust.
Public benchmarks (MMLU, SWE-bench, HumanEval) are useful but only get you partway. To know which model is actually best for your use case, you have to evaluate it on your own tasks. Here’s a 2026 protocol that takes a couple of hours and saves months of switching.
Advertisement
The 4-step evaluation protocol
1. Build a 30-task evaluation set (1 hour)
Pick 30 tasks that represent your real work. Cover edge cases, ambiguity, your domain’s specific quirks. Save inputs and your “ideal” outputs (or rubrics if outputs are open-ended). 30 tasks is the sweet spot — enough to be statistically meaningful, few enough to grade by hand.
2. Run each candidate model with the same system prompt (30 min)
Use a consistent prompt template. Don’t bias one model with longer instructions or more examples than another. For a fair fight, run each candidate with default temperature and the same context. Save the outputs.
3. Grade with rubrics (1 hour)
Score each output 1-5 on relevant dimensions: correctness, faithfulness, format compliance, conciseness. Aggregate by mean score per model. The numbers will surprise you — the model that “feels” best in casual chat often loses on consistency.
4. Spot-check the failures (30 min)
Look at the worst 5 outputs from each model. Patterns tell you more than averages. If model A fails on edge cases but nails the common case, and model B is mediocre across the board, A wins for production.
What to actually measure
- Correctness: binary right/wrong on tasks with verifiable answers.
- Faithfulness: does the output stay grounded in the provided source? (Most important for RAG.)
- Instruction adherence: did it actually do what you asked?
- Format compliance: did it output JSON when you asked for JSON?
- Latency: p50 and p95 to first token + completion.
- Cost per task: total input + output tokens × price.
Tools to make this easier
- OpenAI Evals — structured eval framework, model-agnostic.
- Promptfoo — YAML-driven multi-model bake-off. Easiest entry point.
- Anthropic Workbench — for Claude-specific iteration.
- LangSmith — tracing + eval, integrates with LangChain.
- Braintrust / Humanloop — production-grade eval ops.
The trap to avoid
“Vibe-checking” with 5 prompts and picking the model that gave the best answer on those 5. You’re biased. Always evaluate at scale (30+ tasks) and grade structurally. Most people picking models in 2026 are still vibe-checking, which is why so many production deployments use the wrong model.
Compare models head-to-head: frontier model tracker. Cost-compare: Claude vs DeepSeek.
Advertisement