AI & LLMs · Guide · AI & Prompt Tools

How to Use DSPy

Installing dspy-ai, Signatures, Modules (Predict, ChainOfThought, ReAct), MIPROv2 optimizer, metric-driven prompts.

By FreeToolArena Staff · Updated June 2026 · 6 min read

DSPy is Stanford’s framework for programming language models instead of prompting them. You declare what inputs and outputs you want; DSPy’s optimizers compile prompts (and sometimes fine-tunes) from labeled examples. It’s a mental shift: stop hand-crafting prompt strings, start writing typed signatures and letting a compiler tune the demonstrations for you.

What DSPy actually is

DSPy has three pieces. Signatures declare the input-output contract of a call (“question, context → answer”). Modules wrap signatures with a strategy (Predict, ChainOfThought, ReAct, ProgramOfThought). Optimizers(teleprompters like BootstrapFewShot, MIPROv2, BootstrapFinetune) take a small training set plus a metric and search for the demonstrations or instructions that maximize the metric on your model.

The result: the same Python code can target GPT-4o, Claude, Llama, or a local model, and each gets its own tuned prompt. You never hand-edit a prompt string.

Installing

pip install -U dspy

# If you want evaluation datasets or <a href="/learn/fine-tuning">fine-tuning</a> helpers:
pip install "dspy[anthropic]" datasets

Python 3.9+. DSPy uses LiteLLM under the hood so any model LiteLLM supports works out of the box.

First working example

import dspy

lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

class Classify(dspy.Signature):
    """Classify the sentiment of a product review."""
    review: str = dspy.InputField()
    sentiment: str = dspy.OutputField(desc="positive, negative, or neutral")
    confidence: float = dspy.OutputField(desc="0.0 to 1.0")

classify = dspy.Predict(Classify)

out = classify(review="Battery lasts two days, screen is gorgeous.")
print(out.sentiment, out.confidence)
# positive 0.92

No prompt written. DSPy builds one from the signature docstring, field names, and field descriptions. Swap dspy.Predict for dspy.ChainOfThought to get a reasoning field for free.

A real workflow — compile with an optimizer

The power of DSPy is compiling a module against a metric. Here’s a RAG program optimized with BootstrapFewShot:

import dspy

class RAG(dspy.Module):
    def __init__(self, retriever):
        self.retrieve = retriever
        self.generate = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question):
        context = self.retrieve(question, k=5)
        return self.generate(context=context, question=question)

# Training data: a handful of (question, gold_answer) pairs.
trainset = [
    dspy.Example(question=q, answer=a).with_inputs("question")
    for q, a in load_my_qa_pairs()
]

def exact_match(example, pred, trace=None):
    return example.answer.lower() in pred.answer.lower()

optimizer = dspy.BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=4)
compiled_rag = optimizer.compile(RAG(retriever=my_retriever), trainset=trainset)

compiled_rag.save("compiled_rag.json")

The compiler runs your program on the training set, keeps the traces where exact_match passes, and bakes them in as few-shot demonstrations. Switching models later? Re-run compile; the demos that work best for Claude aren’t the demos that work best for GPT-4o.

Evaluation

evaluator = dspy.Evaluate(devset=devset, num_threads=8, display_progress=True)
score = evaluator(compiled_rag, metric=exact_match)

DSPy treats eval as a first-class citizen. If you’re not evaluating numerically, you’re just hand-waving at prompts.

Gotchas

You need labeled data. Optimizers are only as good as your training set and metric. 20–50 good examples beat 500 noisy ones every time.

Compilation calls your model a lot. BootstrapFewShot can make hundreds of calls searching for good demos; MIPROv2 makes more. Use a cheap model for the teacher pass when possible (teacher_settings={ "lm": cheap_lm }).

Saved programs include the model choice. A compiled program optimized for one model may under-perform with another. Treat the JSON artifact as model-specific.

Signatures are strict contracts. If the model returns malformed fields, DSPy raises. Use dspy.Suggest or dspy.Assert for self-repair loops.

When NOT to use it

Skip DSPy for one-off extraction scripts, demos, or anything you won’t evaluate numerically — the optimizer machinery is overkill. Skip it if you need tight streaming UX (DSPy’s sweet spot is batch-style reasoning and RAG). Reach for DSPy when you’re shipping a classifier, extractor, or RAG system you’ll iterate on for months and care about measurable quality gains.

Tighten signature instructions with the prompt improver, inspect compiled program JSON with the JSON formatter, and budget optimizer runs with the token counter before you kick off an overnight compile.

Use these while you read

Tools that pair with this guide

Found this useful?Email Buy Me a Coffee

Continue reading

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →