LLMOps — operating LLMs in production
MLOps assumed you trained the model. With LLMs you usually didn't — so the artifact you version is the prompt, evals replace the accuracy number, and the model can change underneath you. The new loop.
What you'll learn
- What LLMOps keeps from MLOps (versioning, CI, monitoring) and the three assumptions it breaks
- Why the artifact you version is now a prompt + model id + tools + retrieval, not a weights file
- Why a non-deterministic eval — a golden set plus an LLM-as-judge — replaces a single accuracy number
- The four moving parts: prompt versioning, offline evals, online observability, and a gateway
Before you start
On a Friday afternoon someone improved the support bot. They added one sentence to the prompt — “be warm and friendly” — shipped it, and went home. By Monday the downstream JSON parser was failing on about 8% of requests. No code changed. No model changed. The thing that changed was a string in a prompt file with no version, no test, and no eval gate: the friendlier model had started chatting before its JSON, and the parser choked on the prose. Nobody could even say which version of the prompt was live.
LLMOps (LLM Operations — the practice of keeping LLM-powered systems correct, cheap, and fast in production) is what closes that gap. It’s MLOps, re-derived for a world where you usually didn’t train the model and the output is open-ended text.
What carries over, and what breaks
The MLOps loop you already know — data → train → eval → deploy → monitor → retrain — still rhymes. But three of its load-bearing assumptions break the moment the model is an LLM:
- You usually didn’t train the model. The weights belong to OpenAI,
Anthropic, or Google, or they’re open-weights you downloaded. Your
“training” is prompt design, retrieval, tool wiring, and maybe a light
fine-tune. So the artifact you version isn’t a
.joblibof weights — it’s the prompt (plus the model id, the retrieval config, and the tool definitions). - The output is open-ended and non-deterministic. “Accuracy = 0.91” doesn’t exist for “write a helpful, grounded answer.” The same input can give two different outputs. Eval stops being one number and becomes a graded rubric over a fixed set of examples.
- The model can change without you. A provider deprecates a snapshot and
silently routes you to a newer one; behaviour shifts; the prompt that
worked last week now doesn’t. Pinning a dated model version
(
gpt-...-2026-04) buys you time, not immunity — pins get retired.
In classic ML you version one file. In an LLM app the “model” is a system of parts — and the prompt is the one that changes most.
The prompt is the artifact — version it
The single biggest LLMOps habit: treat prompts like code. They live in the repo, change through pull requests, carry a version (a hash or a number), and are tied to the eval run that approved them. The Friday story happens because a prompt got edited in a vendor playground and pasted into production — invisible to version control, untested, unattributable.
When something regresses, the first question is always “what changed?” If the
answer “the prompt went from v6 to v7” is a git log away, you can roll back
in seconds. If the prompt lives in a textbox in someone’s browser, you can’t.
Eval: a golden set plus a judge
You cannot ship a prompt change on vibes — “it looks better in my three test chats” is how the Friday regression shipped. The replacement for a single accuracy number is a small eval suite:
- A golden set — a fixed list of representative inputs, each with what a good answer must satisfy (a reference answer, or a checklist/rubric).
- Deterministic checks — cheap, exact, and the first gate: does the output parse as JSON? does it contain the required policy line? is it under the token budget? These catch the structural breaks (like the Friday one).
- A semantic check — for “is this answer actually good?”, use an LLM-as-judge: a second model scores the answer against the rubric (“grounded in the provided context? 1–5”). It’s noisy, so you average over the set and watch the trend, not a single score. (Full treatment in RAG evaluations.)
You gate the deploy on the suite, exactly like a test suite gates a code merge. Here’s the structural gate alone catching the Friday regression — run it:
The eval gate: cheap structural checks first, then a semantic judge, scored over a fixed golden set — gate the deploy on the trend.
v1 passes 3/3. The friendlier v2 drops to 2/3 — the long
password question trips the prose-before-JSON branch and the parser
rejects it. A 30-line gate in CI would have turned the Friday incident into
a failed check on the pull request.
Observe what you can’t reproduce
Offline evals catch what you can foresee. Production catches the rest — and because LLM output is non-deterministic and the provider’s model can shift, observability is not optional. Log, for every call:
- the prompt version and model id that served it,
- tokens in / out and the cost of the call,
- latency (and whether it streamed),
- a sampled trace of the actual input and output.
Two dashboards earn their keep immediately: cost per request (LLM bills are per-token, so a prompt that doubles in length doubles your bill — see Cost & latency engineering) and latency p50/p95. A scheduled re-run of the golden set against live traffic is your drift detector: if groundedness quietly drops, the provider probably changed the model under you.
The gateway — one choke point for every call
Don’t let 40 services each call the model API their own way. Route every LLM call through a thin gateway (a proxy in front of the providers). One place to enforce the things you’ve learned elsewhere in this curriculum:
- Caching — return the stored answer for a repeated or semantically identical prompt (Caching: exact, semantic & prompt).
- Rate limits + budget caps — stop a runaway loop from becoming a five-figure bill (Rate limiting & denial-of-wallet).
- Retries + fallback model — when the primary provider 503s, fail over instead of failing (Circuit breakers & resilience).
- Logging — every call already passes through here, so this is where the observability above gets captured, for free.
So what is LLMOps, in one line?
It’s MLOps with the artifact moved from weights to prompts, the eval moved from one number to a graded set, and a new failure mode — the model changing without you — that makes observability and pinned versions load-bearing rather than nice-to-have.
Quick check
Quick check
Next
You’ve been operating an LLM app — but on whose computers? Prompts, gateways, vector stores, and GPUs all run somewhere. The next lesson is the ground they stand on: the cloud — AWS, Azure, and GCP — and how to read the menu without drowning in three hundred service names.
Practice this in an interview
All questionsLLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.
The ML lifecycle spans eight phases: problem framing, data collection and validation, feature engineering, training and experimentation, offline evaluation, deployment, production monitoring, and retirement or retraining. Each phase has distinct owners, artefacts, and failure modes that an MLOps practice must systematise.
Hallucinations occur because an LLM is trained to produce plausible next tokens, not verified facts — it has no internal truth-checking mechanism, only statistical patterns. Common causes include rare or conflicting training data, overconfident decoding, and prompts that lead the model to extrapolate beyond what it learned. Mitigation strategies include retrieval-augmented generation, grounding responses to retrieved sources, lowering temperature, and calibrated refusal training.
Tool calling extends the LLM's output space to include structured function invocations. The model emits a JSON object naming a tool and its arguments; the runtime executes the tool and feeds the result back as a new message. An agent is a loop that repeats this cycle — observe, think, act — until the task is complete or a stopping condition is met.