AI & LLMs · Guide · AI & Prompt Tools
How to Use Promptfoo
Write promptfooconfig.yaml assertions, run adversarial tests, and view results in the web UI. Integrate prompt evaluation into your pipeline online for free.
Promptfoo is a CLI that treats prompts like code — YAML tests, assertions, diffs, and CI-friendly output.
Advertisement
Promptfoo is what unit tests look like for LLMs. You declare prompts, test cases, and assertions in a YAML file, run promptfoo eval, and get a side-by-side grid of outputs with pass/fail scoring. It plugs into CI, supports red-teaming, and speaks nearly every model provider natively.
What it is
A Node.js CLI and web viewer. It loads a config, fans out requests across providers and prompt variants, runs deterministic (contains, regex, equals) and model-graded (llm-rubric, similar) assertions, and writes results to a local SQLite database. The viewer renders diffs and lets you share results.
Install / set up
# global install npm install -g promptfoo promptfoo init export OPENAI_API_KEY=sk-...
First run
promptfoo init creates a promptfooconfig.yaml with a sample prompt and two test cases. Run promptfoo eval and it executes every combination of prompts, providers, and tests, then opens a browser view of the results grid.
$ promptfoo eval [==================] 8/8 complete $ promptfoo view Open http://localhost:15500
Everyday workflows
- Compare GPT-4o, Claude, and a local Llama on the same test set to pick the cheapest model that still passes.
- Gate pull requests with
promptfoo eval --assertin CI so prompt regressions never ship. - Run
promptfoo redteamto generate adversarial inputs (jailbreaks, PII leaks, prompt injection) against your app.
Gotchas and tips
Model-graded assertions use an LLM to grade outputs, which means cost doubles per test and the grader itself can be wrong. Pin the grader to a strong model (gpt-4o or claude-3-5-sonnet), cache aggressively with --no-cache=false, and spot-check failures manually for the first few runs.
Config files grow fast. Split tests into separate YAMLs and include them with tests: file://tests/*.yaml, and store expensive fixtures in vars files so you’re not pasting 500-line prompts into the main config. Commit the SQLite database to keep a history if you don’t have a shared backend.
Who it’s for
Engineers who treat prompts as production code and want a Jest-style workflow for them. Also security teams running red-team exercises — the built-in attack library is genuinely useful and saves weeks of manual work.
Use these while you read
Tools that pair with this guide
- System Prompt BuilderCompose a focused system prompt from a role, tone, constraints, and output format — copy-ready for any LLM.AI & Prompt Tools
- AI Prompt GeneratorTurn a vague idea into a structured prompt. Pick role, task, context, constraints, and output format. Works with ChatGPT, Claude, and Gemini.AI & Prompt Tools
- AI Token CounterEstimate tokens, characters, words, and approximate API cost for GPT-4o, GPT-4, Claude, and Gemini — before you hit send.AI & Prompt Tools
- AI Prompt LibraryBrowse a curated catalog of prompt templates for writing, coding, marketing, and research. One click to copy.AI & Prompt Tools
Advertisement
Continue reading
- AI & LLMsGitHub Copilot Pricing and ComparisonCompare free vs paid GitHub Copilot tiers and analyze it against ChatGPT, Cursor, and Tabnine. Find the best value plan instantly with this free online guide.
- AI & LLMsGitHub Copilot Features and CapabilitiesTest what Copilot really does — code accuracy, scope limits, debugging, web dev, legacy code, tests, docs, team customization. Free guide, no sign-up.
- AI & LLMsGitHub Copilot Security and Data HandlingAudit where your code goes, who sees it, training-data policy, network needs, and what happens when Copilot suggests broken code. Free, no sign-up.
- AI & LLMsAI Fluency SkillsThe 8 sub-skills of AI fluency: prompt structure, model selection, tool use, quality calibration, iteration, context management, cost awareness, privacy.
- AI & LLMsAnthropic Skills ExplainedSkills as Anthropic's answer to Custom GPTs — markdown-defined, version-controlled in git, work in terminal. Anatomy + Skills vs Custom GPTs.
- AI & LLMsKimi K2 vs DeepSeek V3Two open-weight Chinese flagships. Kimi K2 = 1M context, DeepSeek V3.2 = top-tier reasoning + coding. Pick by use case.