DSPy: declarative prompting in production

There is a small genre of framework that the smartest researchers love and the median engineer never picks up. PyTorch Lightning was one for a while. JAX still is. And DSPy — Stanford NLP’s “framework for programming, not prompting” — has become the canonical example for LLM application code.

The DSPy pitch is irresistible on paper. Stop hand-writing prompts. Declare what you want as a typed Signature — question -> answer — compose modules (ChainOfThought, ReAct, Retrieve), give the framework a small training set and a metric, and let an optimizer (MIPROv2, BootstrapFewShot, COPRO) discover prompts that beat what a human would write. The compiled program is portable across models. When you swap GPT-4o for Claude or for a local Qwen, you re-run the optimizer and ship.

If this sounds like the future, it sometimes is. Omar Khattab, DSPy’s creator, has spent two and a half years arguing — convincingly — that prompts are the last bastion of pre-deep-learning string engineering, and that we should be doing to prompts what we did to features when scikit-learn arrived. Databricks hired him. The Stanford NLP group ships major releases. DSPy 3.0 dropped at the Data + AI Summit 2025. The repo has crossed 28,000 stars and the package gets ~160,000 PyPI downloads a month.

And yet. If you survey ten production LLM teams in 2026, you’ll find one running DSPy and nine running hand-crafted prompts behind LangGraph or vanilla function calls. That gap between “obviously a better idea” and “what most teams actually do” is the whole subject of this post.

What DSPy is actually doing

The mental model that helps: DSPy treats your LLM program the way PyTorch treats a neural network. You declare the structure (Signatures and Modules); you run a training loop (an optimizer); you get back a compiled program with frozen prompt parameters.

DSPy’s whole shape: Signatures define the I/O contract, Modules wrap a signature with a calling strategy, an Optimizer fits prompts to a metric. The compiled artifact is what ships.

Concretely:

import dspy

class GenerateAnswer(dspy.Signature):
    """Answer questions given retrieved context."""
    context: list[str] = dspy.InputField()
    question: str = dspy.InputField()
    answer: str = dspy.OutputField(desc="short factual answer")

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)

# Train: give DSPy examples + a metric
optimizer = dspy.MIPROv2(metric=answer_exact_match)
compiled_rag = optimizer.compile(RAG(), trainset=trainset)

What compile() actually does is search the space of (instruction text, few-shot demonstrations) for the combination that maximises your metric on the trainset. MIPROv2 is a Bayesian-optimisation search over both at once; BootstrapFewShot uses a teacher model to generate demonstrations; COPRO co-optimises the prompts of multiple cooperating modules.

The thing to internalise: the prompt is not the source code anymore. The Signature is. The actual prompt is a build artifact produced by the optimizer. This is exactly the relationship between weights and architecture in a neural network — you don’t read the weights, you read the architecture and trust the loss function to find the weights.

For research workloads this is a win without trade-offs. Stanford NLP’s own benchmark work — RAG over Wikipedia, multi-hop question answering with ColBERTv2, agentic reasoning chains — gets clean state-of-the-art numbers in DSPy in a couple of hundred lines because the hyperparameter that matters most (the prompt) is being learned, not guessed.

Who’s actually running it in production

Khattab’s group maintains a running list of production deployments and the names are real:

JetBlue ships BlueBot — its customer-service hybrid RAG chatbot — on DSPy + Databricks. The team specifically points to feedback-classification and retrieval-quality metrics that DSPy was able to optimise against where hand-prompting had plateaued.
Databricks itself runs multiple internal DSPy pipelines and treats DSPy as a first-class abstraction in its Lakehouse AI tooling. The team wrote up the integration story when Khattab joined.
VMware uses DSPy for retrieval pipelines (the company has presented this in Khattab’s talks at Databricks events and at the Data + AI Summit).
Shopify, Dropbox, Moody’s, AWS, Sephora all show up in the maintained use-case list — Khattab’s claim is “in production at … and dozens more.”

The pattern across all of these:

The pipeline is complex enough that hand-prompting hit a ceiling. Two-stage retrieval, multi-hop reasoning, multiple chained modules. Single-step Q&A doesn’t motivate DSPy.
There’s an evaluable metric. JetBlue can score retrieval recall and answer quality; that’s the loss function the optimizer needs.
The team has internal ML capacity. Reading what MIPROv2 is doing, debugging when the optimizer overfits, knowing when to use BootstrapFewShot vs MIPROv2 — these are ML-engineer questions, not prompt-engineer questions.

When all three conditions are met, DSPy is meaningfully better than hand-crafted prompts in the same way scikit-learn is meaningfully better than hand-tuned weights. When any one of them is missing, the trade-offs go the other way.

Why most teams quietly keep their prompts

Now the part of the post that explains the 1-in-10 number. DSPy’s adoption ceiling isn’t a marketing problem. It’s a structural one, because the thing DSPy abstracts away — the prompt — is exactly the thing most production teams want to read.

A list, distilled from the conversations I’ve had with teams that evaluated DSPy and didn’t adopt:

You can’t read what shipped. The compiled prompt is whatever the optimizer settled on. It probably looks fine. It might also have a sentence that drifts during the next compile run. When an incident happens at 3 AM, you want a prompt file in git, not “let me re-run the optimizer to see what it would generate.”
The optimizer is non-trivial. MIPROv2 with default parameters can spend hundreds of LLM calls compiling a non-trivial module. That’s a real cost in dollars and time, and it’s a cost you pay again whenever you change models. Some teams have an annual budget for this; many don’t.
Onboarding is harder. A new engineer reading a LangChain agent in 2026 understands what’s happening in minutes. A new engineer reading a DSPy program has to learn Signatures, Modules, the optimizer pipeline, and the difference between forward() and what gets compiled. The slope-of-learning curve is the dominant adoption cost.
The “portable across models” pitch isn’t free. When you swap models, you don’t just swap a flag — you re-compile. If you compile against GPT-4o-mini and the optimizer found a prompt that works because of a specific quirk of that model’s instruction-following, you may need to refit the trainset before the same Signature works on Claude.

The teams that have adopted DSPy and stuck with it have all built guardrails around these. JetBlue inspects compiled artifacts and checks them into git as text. Databricks built tooling around the optimizer’s logs. Both teams have an ML engineer in the room. The median product team building a chatbot does not.

The optimizer family, briefly

A useful summary of DSPy’s optimizers, because the names sound similar and the trade-offs aren’t obvious until you’ve tried each:

Four optimizers, four different bets on where the win lives. Most production users start with BootstrapFewShot for quick iteration and move to MIPROv2 when the metric plateaus.

The progression most teams follow:

Start with BootstrapFewShot because it’s cheap. The teacher model (usually a larger model than the student) generates demonstrations the optimizer keeps.
Move to MIPROv2 when you’ve maxed out what few-shot demos alone can do. Bayesian optimisation over both instructions and demos — more expensive, frequently meaningfully better.
Reach for COPRO only when you have several modules whose prompts influence each other, and improving one in isolation breaks the others.
Use BootstrapFinetune as a closing step when the compiled program works but the per-call cost is too high — distil it into a smaller fine-tuned model.

The Databricks blog walks through the JetBlue version of this progression: they started with hand-crafted prompts, moved to BootstrapFewShot when hand-tuning plateaued, then to MIPROv2 when answer-quality was no longer improving with demos alone. The framing they used internally was “we have an ML team, we should write ML code instead of prompt strings.”

DSPy and the broader prompt-optimization world

DSPy didn’t invent automatic prompt search — but it’s the project that made it a recognisable engineering discipline. The 2026 landscape now has several adjacent tools:

TextGrad treats prompts as differentiable objects and uses LLM-generated “gradients” (textual critiques) to improve them. Conceptually close to DSPy’s optimizer family.
Pydantic Evals is the Pydantic team’s answer to “let me at least evaluate my prompts even if I don’t automate their optimisation.” Adoption has been faster than DSPy in some teams because the abstraction is shallower.
Provider-side optimisers — OpenAI’s Prompt Generator and Anthropic’s prompt improvement tools have started to offer “we’ll improve your prompt for you” features. These don’t match DSPy’s optimiser quality but they’re integrated into the API console, which lowers the bar to try.

The interesting question is whether DSPy stays a framework or becomes a set of techniques that all the other frameworks borrow. DSPy 3.0 moved in the direction of being more interoperable — better integration with LangChain, with MLflow, with Databricks Model Serving — which is the kind of move you make when you’ve decided the techniques matter more than the framework boundaries.

A useful frame: DSPy’s relationship to the rest of the LLM stack now looks a lot like scikit-learn’s relationship to web frameworks in the 2010s. You don’t run a Django app on scikit-learn; you run scikit-learn behind a Django endpoint when the feature requires a trained model. DSPy’s natural home in 2026 is the same: behind a LangGraph node or a Pydantic AI agent, when the prompt is complex enough to need optimisation. The “all-in-on-DSPy” architecture is going away; the “DSPy as the optimiser tier” architecture is the one that’s growing.

The companies running this combo at scale — JetBlue, Databricks, the new MLflow-integrated workflows — describe it the same way. DSPy is where the prompt artefact gets compiled. LangGraph or the application layer is where the compiled artefact runs. The boundary is clear, the ownership is clear, and the integration is intentional.

A small caution on benchmarks vs production

One pattern worth flagging because it shows up in nearly every DSPy debate. The framework’s research benchmarks — RAG on Wikipedia, multi-hop QA, code generation — are the strongest signal for what an optimiser can do when the metric is clean and the dataset is large. The production benchmarks are harder, because production metrics are noisy and label budgets are small.

Two specific failure modes show up in the wild:

The optimiser overfits the trainset. With 30 examples and an aggressive MIPROv2 run, the optimiser will discover a prompt that nails those 30 examples and underperforms on the 31st. The fix is the same as any ML model — bigger held-out set, harder cross-validation, shorter optimiser runs.
The metric is a proxy that drifts. “Answer F1 against a labelled set” feels like the right metric until you discover that the labels were generated by an earlier version of the model, and the optimiser has learned to mimic that earlier version’s style. Real human-graded metrics are the cure but they’re expensive.

The teams that ship DSPy successfully treat the optimiser run like a training run for a small neural network. Reproducible seed, versioned trainset, held-out test set with confidence intervals, and the compiled artefact reviewed before promotion. The teams that fail treat it like a “magic improve-my-prompt” button.

When DSPy is the right tool, concretely

A working rule:

Pick DSPy when the program is at least three modules deep, you have an evaluable metric, the eval set is at least ~50 labelled examples, you have ML engineering on the team, and the upside of an optimised pipeline (better retrieval recall, fewer multi-hop errors, lower latency through better prompts) justifies the build complexity.
Skip DSPy when the prompt is the product (writing assistants, brand-voice tools, anything where a copywriter is the implicit editor), when the team has no ML engineer, when the model is locked to one provider and you don’t expect that to change, or when the pipeline is a single LLM call.

The Stanford research story is genuinely thrilling and the production case studies (JetBlue, Databricks, VMware) are real. The reason DSPy isn’t the default is the same reason scikit-learn isn’t the default inside every web app — most teams aren’t building ML systems, they’re building features that use an LLM, and the engineering cost-curve goes the other way for them.

What to take away

A compressed version, for the team lead deciding whether to pilot DSPy next quarter:

DSPy is the most academically rigorous prompt framework we have. Signatures, Modules, MIPROv2 — these are the right primitives, and they will outlive the current crop of LLM frameworks.
The production wins are real but narrow. Complex multi-stage pipelines with measurable metrics and ML engineering muscle behind them. JetBlue’s BlueBot, VMware’s retrieval, Databricks’ internal tooling. The framework earns its keep when the problem is hard.
The reason most teams skip it isn’t bad PR — it’s the cost of abstraction. When the prompt is the product, you want to read it. DSPy’s whole pitch is that you shouldn’t have to. That’s a real trade-off, and most teams haven’t taken it.

Khattab’s argument is still the right one in the long run: prompts should be a compiled artifact, not the source. We just haven’t finished building the tooling that makes the trade-off zero-cost. When we do — and DSPy 3.0 is closer to that than DSPy 2.0 was — the median team will find it harder to refuse.

Further reading: the DSPy docs, the original DSPy paper, the JetBlue + Databricks writeup, and the ColBERTv2 paper that underpins DSPy’s retrieval story.