MLOps vs LLMOps — the 60 / 40 seam

The MLOps / LLMOps seam — shared foundations and genuinely new concerns
LLMOps is not a separate discipline that replaces what came before. It is classical MLOps extended by a set of genuinely new concerns layered on top of a foundation that remains largely intact. Microsoft's Azure architecture guidance frames this precisely: 'To operationalize generative AI workload features, you need to extend your MLOps investments with generative AI operations — not replace them.' [1] Understanding where the seam falls matters practically: teams that re-invest in work that is already solved waste engineering capacity; teams that assume the classical stack covers everything miss the genuinely new surface.
A useful working heuristic — not a measured ratio — is that roughly 60 % of the operational discipline is shared and roughly 40 % is structurally new. The 40 % is new in a deep sense: it requires its own registry semantics, its own evaluation primitives, and its own serving runtimes. It is not a configuration change to the classical stack. This article maps the seam, names the three capabilities most commonly misidentified as new, and points to the four forward articles that develop each new concern in depth.
What carries over from classical MLOps
Six disciplines transfer intact. They require wider artefact schemas but no architectural invention.
Model registry and lifecycle state
LLMs and their adapter layers — LoRA and QLoRA fine-tunes — are model artefacts. They version, promote through lifecycle stages (staging → shadow → canary → production), and require lineage back to training data and evaluation results. The registry surface for an LLM is larger than for a classical model (base weights + fine-tune delta + tokeniser configuration + inference configuration), but the registry pattern is identical. Sculley et al.'s foundational work on hidden technical debt in ML systems identifies the model registry as a prerequisite for managing entanglement and undeclared consumers — concerns that apply equally when the model is a 70 B-parameter transformer. [2]
CI/CD pipeline and deployment gates
The core principle — no model artefact reaches production without passing an automated gate — carries unchanged. Breck et al.'s ML Test Score rubric codifies 28 specific tests and monitoring requirements as prerequisites for production readiness; the rubric authors are explicit that the practice of defining pre-production gates does not depend on model type. [3] What changes in LLM systems is the gate content: classical gates check accuracy metrics on a held-out set; LLMOps gates check capability benchmarks, safety evaluations, and latency/cost budgets. The CI primitives — pipeline, pass/fail, artefact promotion — are the same.
Observability spine
Latency, error rate, cost per request, and throughput are monitored identically regardless of whether the model is a gradient-boosted tree or a large language model. The infrastructure observability layer — metrics, structured logging, distributed tracing — is unchanged. What LLMOps adds is a semantic layer on top of that spine (discussed below), not a replacement of it.
Experiment tracking
Prompt engineering and fine-tuning are experimental processes. Tracking which prompt template or fine-tuning configuration produced which evaluation result is structurally the same problem as tracking which hyperparameter configuration produced the best AUROC. Experiment tracking tools (e.g. MLflow, alongside alternatives such as Weights & Biases or Neptune) handle this with schema extensions, not architectural changes.
Data versioning and lineage
Fine-tune training data requires the same versioning and lineage treatment as any other training dataset. For retrieval-augmented generation (RAG) systems, the document corpus is a first-class versioned artefact — analogous to a feature store snapshot. The tooling pattern is the same; the artefact is different.
What is genuinely new
Five concerns are structurally new — not present in classical MLOps and not addressable by extending its existing components. Each has a forward article that develops it in depth.
Prompt and tool versioning
In a classical ML system, the model artefact encodes the learned behaviour. In an LLM system, a substantial fraction of system behaviour lives in the prompt template — which may include a system prompt, few-shot examples, output format instructions, and tool definitions. These are code: they require the same versioning, review, rollback, and alias-resolution semantics as code artefacts. A change to a system prompt that redirects tool-calling behaviour is a deployment; it must be tracked, gated, and reversible. There is no classical MLOps equivalent for this. See the prompt and tool versioning article in this series for the registry pattern.
Output evaluation as a discipline
Classical ML evaluation uses deterministic metrics against a held-out test set: accuracy, F1, AUROC. For LLM systems, the output is open-ended natural language and single-metric evaluation misses most of what matters. Liang et al.'s Holistic Evaluation of Language Models (HELM) demonstrates this concretely: the benchmark covers 42 scenarios (of which 16 are designated core scenarios that receive full multi-metric evaluation across all 7 dimensions), showing that models optimised on a single benchmark expose significant trade-offs that the benchmark does not capture. [4]
The current standard practice is LLM-as-judge: using a separate language model to score the outputs of the model under test. Zheng et al. showed that strong LLM judges achieve agreement with human raters above 80 % on open-ended questions, while also identifying inherent limitations — position bias and verbosity bias — that require controlled experimental design to manage. [5] The evaluation is therefore itself probabilistic, which creates new requirements for repeatability and confidence-interval tracking. See the eval-as-test-suite article in this series for the CI integration pattern.
RAG pipeline observability
A retrieval-augmented generation system can fail in three distinct ways that standard application performance monitoring cannot attribute. Retrieval failure: the retriever returned irrelevant chunks. Context integration failure: the generator ignored or misused the chunks it received. Corpus freshness failure: the source documents were stale. Each failure mode requires a different fix — retriever tuning, prompt revision, or indexing schedule change respectively. The observability surface that distinguishes these modes does not exist in classical ML serving and is not an extension of the standard APM stack. See the RAG observability article in this series.
KV-cache management and LLM serving runtime
LLM inference is autoregressive: the model generates one token at a time, attending over the full context on every step. The key-value (KV) cache — the materialised attention state for all prior tokens — is the dominant GPU memory consumer during serving. Classical ML serving frameworks have no equivalent. Kwon et al.'s PagedAttention paper (SOSP 2023) quantifies the problem precisely: existing LLM serving systems wasted 60–80 % of GPU memory due to KV-cache fragmentation and reservation, and the vLLM system using paged memory management achieved 2–4× throughput improvement over prior approaches. [6] Purpose-built LLM serving runtimes (e.g. vLLM, Text Generation Inference, SGLang) exist specifically to address this concern; no classical model serving framework addresses KV-cache management. See the inference workloads article for the cluster-level implications of KV-cache sizing and serving runtime selection.
A fifth concern — prompt injection security — sits at the boundary of security operations and LLMOps. OWASP ranks prompt injection as the number-one risk in its LLM Top 10 in both the 2023/24 and 2025 editions: LLMs process instructions and data in the same channel, which means a crafted input can override developer-defined instructions without triggering any classical input-validation guard. [7, 8] This concern has no direct equivalent in classical ML serving and requires its own mitigations — input sanitisation, output filtering, indirect injection detection.
What is wrongly assumed new
Three capabilities are routinely re-invented for LLM systems when the classical MLOps discipline already provides the answer. Re-investing in them is wasted capacity.
Deployment patterns
LLM deployment is containerised endpoint deployment. Blue-green, canary, and rollback patterns apply without modification. The observation that 'LLMs are different to deploy' almost always conflates model size — a logistical concern around image layers and GPU node scheduling — with the deployment pattern, which is conventional. Model size is an engineering constraint to manage; it does not change the pattern.
The model registry
A common argument holds that because LLMs are large and expensive to retrain, the registry pattern does not apply — the base model is fixed, so there is nothing to version. This reasoning is backwards. Because LLMs are large and expensive to retrain, the registry is more important: adapter layers, tokeniser configurations, inference parameters, and evaluation results all require version control, and the cost of not knowing which version is in production is higher, not lower.
The need for pre-production evaluation gates
The argument that LLMs are too open-ended to evaluate rigorously is a rationalisation for skipping evaluation, not a structural property of LLMs. The ML Test Score rubric is explicit: systematic pre-production testing is a prerequisite for production readiness regardless of model type. [3] The specific metrics change when the output is natural language — accuracy gives way to benchmark scores, safety evaluations, and LLM-judge pass rates — but the practice of defining and enforcing a gate does not change.
The recommended posture for a platform team
Treat LLMOps as an extension of the MLOps capability surface. Audit the existing MLOps stack against the six shared disciplines above; do not rebuild what already works. Then invest deliberately in the five genuinely new areas:
- A prompt and tool registry with version control, alias resolution, and rollback — the versioning article in this series develops this pattern.
- An eval-suite harness with LLM-judge support, deterministic pass/fail gates, and CI integration — the eval-as-test-suite article in this series develops this.
- A RAG observability surface that scores retrieval and generation separately — the RAG observability article in this series develops this.
- An LLM serving runtime layer — e.g. vLLM, Text Generation Inference, SGLang — that sits inside the existing serving framework rather than parallel to it, with KV-cache and throughput observability fed back to the same metrics backend already in use.
Security posture for prompt injection sits alongside these four: deploy an AI gateway as the single ingress in front of all LLM calls, with prompt-injection filtering, output sanitisation, rate limits, and cost attribution integrated as first-class features. This is the agent runtime article's territory.
Shared vs new — side-by-side
The diagram below places each capability in its correct category. 'Shared' means the classical MLOps discipline carries over with schema or metric changes only. 'New' means a structural new concern requiring its own platform component.
flowchart LR
subgraph SHARED ["Shared — classical MLOps carries over"]
S1[Model registry + lifecycle state]
S2[CI/CD pipeline + deployment gates]
S3[Infrastructure observability spine]
S4[Experiment tracking]
S5[Data versioning + lineage]
S6[Blue-green / canary deployment]
end
subgraph NEW ["New — structurally new LLMOps concerns"]
N1[Prompt + tool versioning]
N2[Output evaluation — LLM-as-judge]
N3[RAG pipeline observability]
N4[KV-cache + LLM serving runtime]
N5[Prompt injection defence]
end
SHARED -->|"extend with wider\nartefact schemas"| NEWReferences
[1] Microsoft Azure Architecture Center. Generative AI Operations for Organizations with MLOps Investments (GenAIOps for MLOps). Microsoft, 2024.
[2] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J-F., Dennison, D. Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015.
[3] Breck, E., Cai, S., Nielsen, E., Salib, M., Sculley, D. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. IEEE Big Data 2017, Google Research.
[4] Liang, P., Bommasani, R., Lee, T., et al. Holistic Evaluation of Language Models (HELM). Stanford CRFM, arXiv:2211.09110, 2022.
[5] Zheng, L., Chiang, W-L., Sheng, Y., et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023, arXiv:2306.05685.
[6] Kwon, W., Li, Z., Zhuang, S., et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023, arXiv:2309.06180.
[7] OWASP Gen AI Security Project. OWASP Top 10 for LLM Applications 2023/24 — LLM01: Prompt Injection. OWASP, 2023.
[8] OWASP Gen AI Security Project. OWASP Top 10 for LLM Applications 2025 — LLM01:2025 Prompt Injection. OWASP, 2025.
Tags
About the Author

asleekgeek
Senior Developer, Architect, DevOps
Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist
Thanks for reading! Explore more articles.
Back to Articles