The ML lifecycle, end to end, in production

The ML lifecycle is a closed loop — systems that skip the monitoring-to-retraining arc degrade silently in production.
A model that goes from training to serving without ever looping back to retraining is not an ML system in production — it is a one-shot batch job. The distinction matters because data distributions shift, user behaviour evolves, and the world the model was trained on drifts away from the world the model is asked to predict in. The discipline of closing that loop, reliably and repeatably, is what separates an MLOps practice from a notebook-to-API pipeline.
This article walks the eight canonical stages of the ML lifecycle — problem framing, data preparation, training, evaluation, registry, serving, monitoring, and retraining — and names, for each stage, the input it consumes, the output it produces, the failure mode that most commonly kills it, and the one decision that defines its quality. It closes with four lifecycle anti-patterns and how to recognise them before they cost you. Readers who have followed earlier articles in this series will have the vocabulary for deployment contexts and team roles already in place; this article builds on both.
The lifecycle as a closed loop
Google's architecture guidance for MLOps formalises three automation properties: continuous integration (CI) of code and data, continuous delivery (CD) of trained models to serving, and continuous training (CT) — the property unique to ML systems that automatically retrains and re-evaluates models when data conditions change [1]. CI and CD are familiar from software delivery; CT is the leg most organisations build last and break first. The monitoring-to-retraining arc — stages seven and eight in the diagram below — requires three independently functioning surfaces: a monitoring layer that detects drift, a labelling or data-refresh mechanism, and a retraining pipeline that still passes evaluation after months of quiescence. Each of those surfaces fails in its own way.
flowchart TD
A[Problem Framing] --> B[Data Preparation]
B --> C[Training]
C --> D[Evaluation]
D --> E[Registry]
E --> F[Serving]
F --> G[Monitoring]
G -->|drift / quality trigger| H[Retraining]
H --> C
E -->|end-of-life decision| I[Retirement]
I -->|remove from serving| F
style I fill:#f5f5f5,stroke:#999,stroke-dasharray:4 4The failure to close the loop is the most common reason ML systems degrade without triggering an explicit alert. Sculley et al.'s foundational survey of ML technical debt [2] identifies feedback loops, undeclared consumers, and pipeline rot as the structural sources of this degradation — all of them downstream of a monitoring stage that was never given an actionable escalation path.
Stage 1 — Problem framing
Input: a business objective stated in natural language. Output: a measurable ML problem with a defined target metric, a baseline (typically a simple heuristic or the current rule-based system), and an explicit decision on whether ML is warranted at all.
Failure mode: skipping the baseline. A team trains a sophisticated model, benchmarks it against itself, and ships it — never establishing whether a rules-based system or a simple regression would have served equally well at a fraction of the ongoing operational cost. Without a baseline, you cannot know whether the model is adding value or merely adding complexity.
The one decision that defines its quality: does the team have a measurable success criterion that a non-technical stakeholder can verify independently of the engineering team's claims?
Stage 2 — Data preparation
Input: raw data sources. Output: versioned, validated train/validation/test splits with a documented schema and transformation logic.
Failure mode: training-serving skew. The feature transformation applied at training time is not identical to the transformation applied at inference time. This is among the most insidious failure modes because it produces a model that evaluates well offline and underperforms silently in production — the performance delta is invisible to any test that runs on the training distribution. Sculley et al. name this class of problem explicitly as a source of ML-specific technical debt arising from data dependencies.
The one decision that defines its quality: is the transformation code that runs at training time the same artefact that runs at inference time, verifiably, or are there two codepaths that are assumed to be equivalent?
Stage 3 — Training
Input: versioned dataset, experiment configuration. Output: a trained model artefact with tracked metadata — hyperparameters, evaluation metrics, dataset version, and a pointer back to the experiment run.
Failure mode: experiment debt. Hundreds of runs tracked inconsistently — or not tracked at all — make it impossible to reproduce the model that scored best or to understand what changed between versions. The fix is treating the experiment tracker as a first-class system of record from day one, not retrofitting it after the team has accumulated entropy across six months of ad-hoc notebooks.
For distributed training on Kubernetes, the platform substrate — gang scheduling, distributed training operators, GPU quota enforcement — becomes relevant here. The training stage is where workload shape (single-node vs multi-node, GPU-bound vs CPU-bound) most directly constrains platform design. Those infrastructure choices are covered in Part 3 of this series.
The one decision that defines its quality: can any engineer on the team reproduce the best model from a cold start, using only the experiment tracker as the source of truth?
Stage 4 — Evaluation
Input: trained model artefact, held-out test set. Output: a signed-off evaluation report covering aggregate metrics, slice analysis, fairness checks, and adversarial probing — plus a model card documenting the results. Breck et al.'s ML Test Score [3] provides 28 specific tests across four categories — data tests, model tests, ML infrastructure tests, and monitoring tests — as a structured rubric for what a production-ready evaluation suite must cover.
Failure mode: aggregate metric tunnelling. A team optimises a single headline metric (accuracy, AUC, F1) and never examines slices. A model that achieves 92% overall accuracy while performing at 61% on a minority demographic slice will pass every automated gate and fail every ethical review. Slice analysis is not optional for systems whose outputs affect people.
The one decision that defines its quality: does the evaluation report include slice analysis broken down by the dimensions that matter for the use case, or only an aggregate score?
Stage 5 — Registry
Input: evaluated model artefact with attached metadata. Output: a versioned, registered model with a defined lifecycle state (experimental → staging → production → archived) and a promotion gate that must be passed before a model reaches production.
Failure mode: model-of-record drift. Production is running a model that cannot be identified in the registry, whose training run metadata has been lost, and whose training data version is unknown. This is the most dangerous silent failure in the lifecycle — it means you cannot answer the four questions a regulator or incident responder will ask: what model is serving, where did it come from, who approved it, and is the served artefact the artefact that was evaluated?
The registry also serves as the GitOps trigger: when a model transitions to the Production state, an automated handoff writes an updated serving manifest to the GitOps repository. This seam — registry promotion to infrastructure reconciliation — is the most under-documented in the standard lifecycle. Part 4 of this series covers registry patterns, lifecycle states, and the curation-policy-as-code pattern in depth.
The one decision that defines its quality: can you trace a running model in production back to its exact training dataset version, training run hash, and the person who approved its promotion — in under five minutes, from a cold start?
Stage 6 — Serving
Input: a promoted model artefact from the registry, a serving configuration. Output: a containerised inference endpoint with defined SLAs, a deployment strategy (canary, blue-green, or rolling), and a rollback path.
Failure mode: shadow debt. A model is deployed manually — via a direct kubectl command or a one-off script — and exists outside any GitOps loop. The next release has no safe rollback path because the baseline state was never declared as code. Shadow deployments accumulate silently: engineers move on, the original deployer forgets, and the model is effectively orphaned with no known owner and no documented rollback procedure.
Serving infrastructure choices — runtime families (general-purpose, LLM-specialised, framework-specific, embedded), autoscaling signals, canary advancement — belong to Part 3 and Part 4 of this series. The lifecycle article's job is to name the failure mode and the quality decision; the serving articles own the how.
The one decision that defines its quality: is every production model deployment declared as code in a GitOps repository, with a documented and tested rollback path?
Stage 7 — Monitoring
Input: live prediction requests and outcomes, ground-truth labels (where available), and system telemetry. Output: drift alerts, performance degradation signals, and — critically — a retraining trigger. Gama et al.'s comprehensive survey of concept drift adaptation [4] distinguishes three drift types that the monitoring layer must handle separately: covariate shift (the input distribution changes but the underlying relationship holds), concept drift (the relationship between input and target changes), and label drift (the distribution of target labels shifts). Each requires a different detection strategy and a different remediation response.
Failure mode: alert-only monitoring without an actionable response. Alerts fire, no one owns the on-call rotation for ML quality, the alert is silenced, and the model continues degrading. A monitoring layer without a defined owner, an escalation path, and a retraining trigger is logging theatre — it generates the appearance of observability without the operational capability to act on it.
The one decision that defines its quality: does every drift alert have a named owner, a defined escalation path, and a retraining trigger — or do alerts accumulate in a dashboard that no one is on call to read?
Stage 8 — Retraining
Input: a retraining trigger (scheduled, drift-triggered, or manual), refreshed data. Output: a new candidate model that has passed the same evaluation suite as the original and been promoted through the registry.
Failure mode: pipeline rot. The retraining pipeline was written during the original project, never maintained as a production service, and fails silently when triggered months later because a dependency has changed, a data source has moved, or the infrastructure configuration has drifted from the environment the pipeline was written for. The retraining pipeline must be treated as a production service — with tests, versioning, and on-call ownership — not as a script that worked once.
The retraining pipeline should be the same artefact as the training pipeline — not a parallel script. If retraining requires a separate code path, that path will diverge from the original and the divergence will be discovered at the worst possible moment: when a production model needs to be replaced urgently.
The one decision that defines its quality: is the retraining pipeline tested on a schedule independently of whether a retraining trigger has fired — so that pipeline rot is detected before it matters?
How the lifecycle shifts across deployment contexts
The eight-stage lifecycle is universal. What changes across the deployment-context spectrum — pure-cloud, on-premises, hybrid, air-gapped — is where each stage executes, who operates it, and what constraints apply. In a pure-cloud context, most pipeline infrastructure is managed; in an on-premises or air-gapped context, every runner, registry, and monitoring backend is self-hosted and self-maintained. The lifecycle itself does not change; the operational burden at each stage does.
Two stages are most visibly affected by deployment context. Data preparation splits along data-residency lines in hybrid and regulated environments — some features may only be computed on the on-premises side, creating a pipeline that spans an interconnect boundary. Monitoring is affected in air-gapped environments because telemetry cannot leave the perimeter, so every observability backend — metrics, logs, traces, drift detection — must run inside the perimeter.
Four lifecycle anti-patterns
These four anti-patterns appear consistently in ML systems that fail in production. Recognising them early is cheaper than diagnosing them after a degradation incident.
1. The open loop
The model is deployed and the team moves on. There is no monitoring, no drift detection, and no retraining trigger. The model degrades silently until a business stakeholder notices that something has gone wrong — typically months after the model started failing. This is the most common lifecycle anti-pattern and the easiest to prevent: deploy monitoring at the same time as the model, not afterwards.
2. The frozen pipeline
Monitoring is deployed but the retraining pipeline has not been maintained. Drift alerts fire, the on-call engineer acknowledges them, and then discovers that the retraining pipeline fails for an unrelated reason — a broken dependency, a changed data schema, a rotated credential. The fix is continuous smoke-testing of the retraining pipeline on a schedule, independent of whether a drift signal has been received.
3. The unregistered deployment
A model is deployed outside the registry — directly to a serving endpoint, via a manual script, or by copying an artefact from a shared drive. The registry state and the serving state diverge. The next engineer to investigate a production issue cannot determine which model version is running or trace it back to a training run. This anti-pattern often originates from a well-intentioned hotfix that was never formalised.
4. The dual codepath
The training pipeline and the retraining pipeline are separate scripts that share no code. The transformation logic diverges between them over time. The model trained by the retraining pipeline produces different outputs than the model trained by the original pipeline on the same data — not because the model has been intentionally changed, but because the two codepaths have silently drifted apart. The fix is a single pipeline with a parameter that controls whether the run is an initial training run or a retraining run.
What this series carries forward
The eight stages and their failure modes are the shared vocabulary for the rest of this series. Part 2 continues with the organisational patterns for owning the lifecycle — because the lifecycle's failure modes do not all arise from technical choices. Many arise from unclear ownership at the stage boundaries: who owns the monitoring-to-retraining handoff, who owns the registry-to-serving handoff, and what happens when a stage has no named owner. Part 3 goes deep on training workloads on Kubernetes and Part 4 covers registry patterns and lifecycle state management in depth.
References
[1] Google Cloud. MLOps: Continuous delivery and automation pipelines in machine learning. Google Cloud Architecture Center. https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
[2] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015. https://proceedings.neurips.cc/paper_files/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html
[3] Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. IEEE Big Data 2017. https://research.google/pubs/the-ml-test-score-a-rubric-for-ml-production-readiness-and-technical-debt-reduction/
[4] Gama, J., Zliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4), Article 44. DOI: 10.1145/2523813.
Tags
About the Author

asleekgeek
Senior Developer, Architect, DevOps
Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist
Thanks for reading! Explore more articles.
Back to Articles