MLOps Hard Asked at OpenAIAsked at AnthropicAsked at CohereAsked at DatabricksAsked at Microsoft

How does LLMOps differ from classical MLOps, and what new operational challenges do LLMs introduce?

For MLOps Engineer AI / LLM Engineer ML Engineer

The short answer

LLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.

How to think about it

LLMOps does not replace MLOps — it layers on top of it. Every classical concern (data versioning, deployment gates, latency SLAs, cost tracking) still applies. But LLMs introduce failure modes that classical pipelines were never designed to detect.

What carries over from classical MLOps

Experiment tracking (prompt versions are hyperparameters; log them in W&B or MLflow).
Deployment patterns: canary, blue/green, rollback.
Latency and throughput monitoring; token throughput replaces inference-time prediction count.
Cost monitoring: token cost per request is analogous to compute cost per batch job.
CI/CD pipelines for model artefacts, now extended to prompt templates and RAG indexes.

What is genuinely new in LLMOps

Prompt engineering as configuration. Prompts are not code, not model weights — they are a third artefact class. They must be versioned, tested in a prompt registry, and deployed with the same discipline as a model binary. A prompt change that is not tracked is an invisible model change.

Non-scalar evaluation. A regression model has MSE; an LLM-generated answer has no obvious scalar ground truth. Evaluation uses: LLM-as-a-judge (route outputs to a stronger model for automated quality scoring), RAGAS or similar RAG evaluation frameworks, human preference labelling, and task-specific pass/fail checks (does the output contain a required entity? is the JSON schema valid?).

Hallucination and faithfulness monitoring. Classical models predict within a fixed output space. LLMs generate free text that can be confidently wrong. Monitoring requires factual grounding checks — comparing generated claims against retrieved context or a knowledge base.

RAG pipeline observability. Retrieval-augmented generation splits the failure surface into retrieval quality (did we fetch the right chunks?) and generation quality (did the model use them faithfully?). Both need independent metrics: retrieval recall at k, context precision, and answer faithfulness.

Prompt injection and safety monitoring. Adversarial inputs can hijack LLM behaviour in ways that have no analogue in a tabular model. Production LLM systems require input and output guardrails (topic classifiers, PII detectors, toxicity filters) as part of the serving pipeline.

Fine-tuning vs. prompting decisions. Classical MLOps always trains a model on task data. LLMOps must first determine whether fine-tuning is necessary or whether a well-engineered prompt with RAG achieves the target quality at lower cost and maintenance burden.

Operational tooling differences

Concern	Classical MLOps	LLMOps addition
Model artefact	Weights file in registry	Foundation model API endpoint or adapter weights
Hyperparameter	Learning rate, depth	Prompt template, temperature, top-p, system message
Evaluation metric	AUC, RMSE, F1	LLM judge score, faithfulness, BLEU/ROUGE (weak)
Drift detection	PSI on input features	Semantic drift on query embeddings, output topic shift
Cost unit	GPU-hours	Input/output tokens, embedding calls

How does LLMOps differ from classical MLOps, and what new operational challenges do LLMs introduce?

What carries over from classical MLOps

What is genuinely new in LLMOps

Operational tooling differences

Keep practising

Explore further