How does LLMOps differ from classical MLOps, and what new operational challenges do LLMs introduce?
LLMOps extends classical MLOps to handle foundation model scale, prompt-based configuration, non-deterministic outputs, and evaluation without a scalar ground truth. Key new concerns include prompt versioning, output quality evaluation via LLM judges or human review, hallucination monitoring, cost management, and RAG pipeline observability.
How to think about it
LLMOps does not replace MLOps — it layers on top of it. Every classical concern (data versioning, deployment gates, latency SLAs, cost tracking) still applies. But LLMs introduce failure modes that classical pipelines were never designed to detect.
What carries over from classical MLOps
- Experiment tracking (prompt versions are hyperparameters; log them in W&B or MLflow).
- Deployment patterns: canary, blue/green, rollback.
- Latency and throughput monitoring; token throughput replaces inference-time prediction count.
- Cost monitoring: token cost per request is analogous to compute cost per batch job.
- CI/CD pipelines for model artefacts, now extended to prompt templates and RAG indexes.
What is genuinely new in LLMOps
Prompt engineering as configuration. Prompts are not code, not model weights — they are a third artefact class. They must be versioned, tested in a prompt registry, and deployed with the same discipline as a model binary. A prompt change that is not tracked is an invisible model change.
Non-scalar evaluation. A regression model has MSE; an LLM-generated answer has no obvious scalar ground truth. Evaluation uses: LLM-as-a-judge (route outputs to a stronger model for automated quality scoring), RAGAS or similar RAG evaluation frameworks, human preference labelling, and task-specific pass/fail checks (does the output contain a required entity? is the JSON schema valid?).
Hallucination and faithfulness monitoring. Classical models predict within a fixed output space. LLMs generate free text that can be confidently wrong. Monitoring requires factual grounding checks — comparing generated claims against retrieved context or a knowledge base.
RAG pipeline observability. Retrieval-augmented generation splits the failure surface into retrieval quality (did we fetch the right chunks?) and generation quality (did the model use them faithfully?). Both need independent metrics: retrieval recall at k, context precision, and answer faithfulness.
Prompt injection and safety monitoring. Adversarial inputs can hijack LLM behaviour in ways that have no analogue in a tabular model. Production LLM systems require input and output guardrails (topic classifiers, PII detectors, toxicity filters) as part of the serving pipeline.
Fine-tuning vs. prompting decisions. Classical MLOps always trains a model on task data. LLMOps must first determine whether fine-tuning is necessary or whether a well-engineered prompt with RAG achieves the target quality at lower cost and maintenance burden.
Operational tooling differences
| Concern | Classical MLOps | LLMOps addition |
|---|---|---|
| Model artefact | Weights file in registry | Foundation model API endpoint or adapter weights |
| Hyperparameter | Learning rate, depth | Prompt template, temperature, top-p, system message |
| Evaluation metric | AUC, RMSE, F1 | LLM judge score, faithfulness, BLEU/ROUGE (weak) |
| Drift detection | PSI on input features | Semantic drift on query embeddings, output topic shift |
| Cost unit | GPU-hours | Input/output tokens, embedding calls |