AI & LLMs · Guide · AI & Prompt Tools
How to Use Promptfoo
Installing promptfoo, writing promptfooconfig.yaml, assertions, red-teaming, CI integration, web UI.
Promptfoo is a CLI that treats prompts like code — YAML tests, assertions, diffs, and CI-friendly output.
Advertisement
Promptfoo is what unit tests look like for LLMs. You declare prompts, test cases, and assertions in a YAML file, run promptfoo eval, and get a side-by-side grid of outputs with pass/fail scoring. It plugs into CI, supports red-teaming, and speaks nearly every model provider natively.
What it is
A Node.js CLI and web viewer. It loads a config, fans out requests across providers and prompt variants, runs deterministic (contains, regex, equals) and model-graded (llm-rubric, similar) assertions, and writes results to a local SQLite database. The viewer renders diffs and lets you share results.
Install / set up
# global install npm install -g promptfoo promptfoo init export OPENAI_API_KEY=sk-...
First run
promptfoo init creates a promptfooconfig.yaml with a sample prompt and two test cases. Run promptfoo eval and it executes every combination of prompts, providers, and tests, then opens a browser view of the results grid.
$ promptfoo eval [==================] 8/8 complete $ promptfoo view Open http://localhost:15500
Everyday workflows
- Compare GPT-4o, Claude, and a local Llama on the same test set to pick the cheapest model that still passes.
- Gate pull requests with
promptfoo eval --assertin CI so prompt regressions never ship. - Run
promptfoo redteamto generate adversarial inputs (jailbreaks, PII leaks, prompt injection) against your app.
Gotchas and tips
Model-graded assertions use an LLM to grade outputs, which means cost doubles per test and the grader itself can be wrong. Pin the grader to a strong model (gpt-4o or claude-3-5-sonnet), cache aggressively with --no-cache=false, and spot-check failures manually for the first few runs.
Config files grow fast. Split tests into separate YAMLs and include them with tests: file://tests/*.yaml, and store expensive fixtures in vars files so you’re not pasting 500-line prompts into the main config. Commit the SQLite database to keep a history if you don’t have a shared backend.
Who it’s for
Engineers who treat prompts as production code and want a Jest-style workflow for them. Also security teams running red-team exercises — the built-in attack library is genuinely useful and saves weeks of manual work.
Advertisement