How does LLMOps differ from classical MLOps, and what new operational challenges do LLMs introduce?

LLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.

Walk me through the full ML lifecycle from problem definition to model retirement.

The ML lifecycle spans eight phases: problem framing, data collection and validation, feature engineering, training and experimentation, offline evaluation, deployment, production monitoring, and retirement or retraining. Each phase has distinct owners, artefacts, and failure modes that an MLOps practice must systematise.

What is MLSecOps, and what are the main threats across the ML lifecycle?

MLSecOps extends security practices across the whole ML lifecycle rather than just the deployed app, covering data, training, the model artifact, and serving. Key threats include data and model poisoning, adversarial evasion inputs, model theft or extraction, membership-inference and privacy leakage, and supply-chain risks like malicious model files and dependencies. Defenses span provenance and validation, robustness testing, access control and signing of artifacts, input monitoring, and scanning, integrated into the MLOps pipeline.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

LLMOps — operating LLMs in production — MLOps

The last lesson’s whole playbook rested on one assumption: that you can roll back to a versioned, deterministic artifact you own. We ended by pulling that assumption out from under you — what if the thing in production is an LLM, with no weights file to pin, whose failures are hallucinations and prompt-injections rather than recall drops, and which a vendor can change underneath you overnight? We asked how versioning, evaluation, monitoring, rollback, and cost all have to change. This lesson is the answer, and it opens with the smallest possible version of the problem.

On a Friday afternoon someone improved the support bot. They added one sentence to the prompt — “be warm and friendly” — shipped it, and went home. By Monday the downstream JSON parser was failing on about 8% of requests. No code changed. No model changed. The thing that changed was a string in a prompt file with no version, no test, and no eval gate: the friendlier model had started chatting before its JSON, and the parser choked on the prose. Nobody could even say which version of the prompt was live.

LLMOps (LLM Operations — the practice of keeping LLM-powered systems correct, cheap, and fast in production) is what closes that gap. It’s MLOps, re-derived for a world where you usually didn’t train the model and the output is open-ended text.

What carries over, and what breaks

The MLOps loop you already know — data → train → eval → deploy → monitor → retrain — still rhymes. But three of its load-bearing assumptions break the moment the model is an LLM:

You usually didn’t train the model. The weights belong to OpenAI, Anthropic, or Google, or they’re open-weights you downloaded. Your “training” is prompt design, retrieval, tool wiring, and maybe a light fine-tune. So the artifact you version isn’t a .joblib of weights — it’s the prompt (plus the model id, the retrieval config, and the tool definitions).
The output is open-ended and non-deterministic. “Accuracy = 0.91” doesn’t exist for “write a helpful, grounded answer.” The same input can give two different outputs. Eval stops being one number and becomes a graded rubric over a fixed set of examples.
The model can change without you. A provider deprecates a snapshot and silently routes you to a newer one; behaviour shifts; the prompt that worked last week now doesn’t. Pinning a dated model version (gpt-...-2026-04) buys you time, not immunity — pins get retired.

In classic ML you version one file. In an LLM app the “model” is a system of parts — and the prompt is the one that changes most.

The prompt is the artifact — version it

The single biggest LLMOps habit: treat prompts like code. They live in the repo, change through pull requests, carry a version (a hash or a number), and are tied to the eval run that approved them. The Friday story happens because a prompt got edited in a vendor playground and pasted into production — invisible to version control, untested, unattributable.

When something regresses, the first question is always “what changed?” If the answer “the prompt went from v6 to v7” is a git log away, you can roll back in seconds. If the prompt lives in a textbox in someone’s browser, you can’t.

Eval: a golden set plus a judge

You cannot ship a prompt change on vibes — “it looks better in my three test chats” is how the Friday regression shipped. The replacement for a single accuracy number is a small eval suite:

A golden set — a fixed list of representative inputs, each with what a good answer must satisfy (a reference answer, or a checklist/rubric).
Deterministic checks — cheap, exact, and the first gate: does the output parse as JSON? does it contain the required policy line? is it under the token budget? These catch the structural breaks (like the Friday one).
A semantic check — for “is this answer actually good?”, use an LLM-as-judge: a second model scores the answer against the rubric (“grounded in the provided context? 1–5”). It’s noisy, so you average over the set and watch the trend, not a single score. (Full treatment in RAG evaluations.)

You gate the deploy on the suite, exactly like a test suite gates a code merge. Here’s the structural gate alone catching the Friday regression — run it:

The eval gate: cheap structural checks first, then a semantic judge, scored over a fixed golden set — gate the deploy on the trend.

import json

# A tiny golden set: each input + the keys a valid answer must contain.
GOLDEN = [
    {"q": "refund window?", "must_have": ["answer", "policy"]},
    {"q": "is item in stock?", "must_have": ["answer", "policy"]},
    {"q": "how do I reset my password?", "must_have": ["answer", "policy"]},
]

# v1 always returns clean JSON.
def prompt_v1(q):
    return json.dumps({"answer": f"Help with: {q}", "policy": "30-day"})

# v2 is the "be warm and friendly" tweak. On longer asks the model now
# chats before the JSON — exactly the Friday regression.
def prompt_v2(q):
    body = json.dumps({"answer": f"Sure! Help with: {q}", "policy": "30-day"})
    return body if len(q) < 20 else "Happy to help! " + body

# The structural gate: does it parse, and have the required keys?
def passes(answer_text, must_have):
    try:
        obj = json.loads(answer_text)
    except json.JSONDecodeError:
        return 0          # downstream parser would break here
    return 1 if all(k in obj for k in must_have) else 0

def evaluate(prompt_fn, name):
    score = sum(passes(prompt_fn(c["q"]), c["must_have"]) for c in GOLDEN)
    print(f"{name}: {score}/{len(GOLDEN)} pass the JSON-validity gate")

evaluate(prompt_v1, "prompt v1          ")
evaluate(prompt_v2, "prompt v2 (friendly)")
print()
print("Same code, same model. The eval gate is what catches v2 before it ships.")

prompt v1          : 3/3 pass the JSON-validity gate
prompt v2 (friendly): 2/3 pass the JSON-validity gate

Same code, same model. The eval gate is what catches v2 before it ships.

There it is. v1 passes 3/3. The friendlier v2 drops to 2/3 — the first two questions are under 20 characters so they slip through clean, but “how do I reset my password?” is 27 characters, trips the prose-before-JSON branch, and the parser rejects Happy to help! {...}. No model changed, no code changed; a single sentence in a prompt silently broke 1-in-3 answers. And notice what caught it: not a human reviewer’s intuition, but a 30-line deterministic gate that, run in CI, would have turned the whole Friday incident into a red check on the pull request before anyone went home.

Observe what you can’t reproduce

Offline evals catch what you can foresee. Production catches the rest — and because LLM output is non-deterministic and the provider’s model can shift, observability is not optional. Log, for every call:

the prompt version and model id that served it,
tokens in / out and the cost of the call,
latency (and whether it streamed),
a sampled trace of the actual input and output.

Two dashboards earn their keep immediately: cost per request (LLM bills are per-token, so a prompt that doubles in length doubles your bill — see Cost & latency engineering) and latency p50/p95. A scheduled re-run of the golden set against live traffic is your drift detector: if groundedness quietly drops, the provider probably changed the model under you.

The gateway — one choke point for every call

Don’t let 40 services each call the model API their own way. Route every LLM call through a thin gateway (a proxy in front of the providers). One place to enforce the things you’ve learned elsewhere in this curriculum:

Caching — return the stored answer for a repeated or semantically identical prompt (Caching: exact, semantic & prompt).
Rate limits + budget caps — stop a runaway loop from becoming a five-figure bill (Rate limiting & denial-of-wallet).
Retries + fallback model — when the primary provider 503s, fail over instead of failing (Circuit breakers & resilience).
Logging — every call already passes through here, so this is where the observability above gets captured, for free.

In one breath

LLMOps is MLOps re-derived for a world where you usually didn’t train the model: the artifact you version moves from a weights file to the prompt (plus model id, retrieval config, and tools, shipped together), the eval moves from one accuracy number to a graded rubric over a golden set (cheap deterministic checks first, then an LLM-as-judge, watched as a trend), and a brand-new failure mode — the provider changing the model underneath you — makes online observability and a single gateway choke point (for caching, rate limits, retries, and logging) load-bearing rather than nice-to-have.

Practice

Before the quiz, reason about the failure mode classic MLOps never had. Your team ships nothing for a month, yet one Tuesday the answers quietly get worse. Walk through what happened and which single LLMOps practice would surface it — and why pinning a dated model version (gpt-...-2026-04) is “time, not immunity.” Then the cheap-gate insight from the demo: the friendly prompt broke only the long question. What does that tell you about why “it looked better in my three test chats” is exactly how the Friday regression shipped?

Quick check

0/3

Q1In a typical LLM app (calling a hosted model), what is the primary artifact you version?

Q2Why can't you gate an LLM deploy on a single accuracy number the way you would a classifier?

Q3Your provider deprecated the model snapshot you were pinned to and routed you to a newer one. Outputs subtly changed, though your team shipped nothing. Which LLMOps practice catches this first?

A question to carry forward

That closes the Serving & Monitoring chapter — and with it, everything about getting a model, classical or LLM, to behave in production. But step back and notice the thing every lesson in it quietly took for granted. The prompt gateway, the vector store, the eval runner, the GPU the model serves from, the bucket the golden set lives in — all of it runs somewhere. We have spent two chapters talking about software that has to execute on actual computers, and we have never once asked whose computers, or what they cost, or how you rent them.

So the question to carry forward, into the final chapter, is the ground the whole stack stands on: where does all of this actually run? Almost certainly not in your office — it runs in the cloud, on rented machines billed by the second, across three providers with three hundred service names each. The next lesson is the map that keeps you from drowning in that menu: the cloud — AWS, Azure, and GCP — and it opens the Platform & Infrastructure chapter.

LLMOps — operating LLMs in production

What you'll learn

Before you start

What carries over, and what breaks

The prompt is the artifact — version it

Eval: a golden set plus a judge

Observe what you can’t reproduce

The gateway — one choke point for every call

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further