When ML breaks: an incident-response playbook for production models

·11 min read·asleekgeek
On-call engineer reviewing a triage decision tree for a machine learning incident on a dashboard

ML incidents require a different triage instinct than classic infrastructure failures.

An ML serving incident can look entirely healthy by infrastructure metrics. Pods are running, CPU is nominal, no HTTP 5xx are firing — and the model has been returning subtly wrong answers for the past two hours. This is the defining challenge of ML operations: the system can be up while simultaneously being broken in ways that matter to users. The first job of on-call triage is therefore not "is it up?" but "which of three distinct failure surfaces is degrading?"

The discipline this playbook borrows from is service-level incident response: alert on symptoms users feel, route on the request path, and write the runbook before the page fires — a principle developed in detail in the Google SRE Workbook, "Alerting on SLOs". What this series adds is the ML-specific layer that classic SRE does not address: the silent wrongness of a model that has drifted, regressed, or been manipulated.

The three failure surfaces

An ML serving incident almost always resolves to one of three surfaces. The response for each is different enough that misclassifying wastes the most expensive first ten minutes of triage:

  • Data issue — the inputs changed. An upstream schema drift, a feature pipeline emitting nulls, a stale feature store, or a genuine shift in the real-world distribution the model sees. The model and the server are behaving exactly as built; the world moved.
  • Model issue — the inputs are fine but the predictions are wrong or worse than baseline. A bad deploy of a regressed model version, a corrupted artefact, or concept drift that has crossed the threshold where the current model is no longer fit for purpose.
  • Serving issue — the predictions would be fine, but the request path is failing: latency, saturation, out-of-memory pod restarts, autoscaler thrash, a broken tokenizer dependency, or a downstream timeout. This is the surface closest to classic SRE.

The triage rule: rule out the serving surface first. It has the loudest, most trustworthy signals (5xx, latency percentiles, saturation metrics) and is the cheapest to confirm. If the request path is healthy, distinguish a data problem from a model problem using the input-distribution and recent-deploy signals described below.

The most expensive triage mistake is reaching for the model first. The model is the slowest surface to inspect — re-running evaluation, comparing versions — and the least likely to have spontaneously changed. Confirm the request path and the input distribution before opening a notebook.

The single most useful triage artefact is a shared timeline overlay: deploy events, feature-pipeline runs, and the metric that paged, on one axis. Most ML incidents are diagnosed not by inspecting the model but by reading what changed at the moment the symptom started. If nothing in your control plane changed, the change is in the world — the signature of a data issue.

Drift taxonomy: what kind of drift are you dealing with?

Not all drift is the same, and diagnosing the type determines the response. The taxonomy established by Gama et al. in their 2014 ACM Computing Surveys paper "A Survey on Concept Drift Adaptation" distinguishes two fundamental classes:

  • Virtual drift (covariate shift) — the input distribution P(X) changes, but the conditional relationship P(y|X) is stable. The model is still correct given what it sees; the world is sending different inputs. This does not necessarily require retraining — the model may still be valid.
  • Real drift (concept drift) — the relationship P(y|X) itself has changed. The model is now wrong in a systematic way. Retraining is the path once quality SLO breach is confirmed.
  • Label drift — the marginal distribution of labels P(y) shifts independently. Common in classification tasks where class frequencies change over time (e.g. fraud rates rising seasonally). Often accompanies real drift but is diagnostically distinct.

The diagnostic question during a drift alert is: have inputs moved (virtual drift) or has the input-to-output relationship moved (real drift)? If you have ground-truth labels arriving in near-real-time, you can answer this directly by computing model error on the recent window. If labels are delayed — the common case — you must infer it from prediction-distribution monitoring and business proxies.

SLOs and error budgets as the triage anchor

The SRE model — Service Level Indicators, Objectives, and error budgets — gives ML incident response its decision boundary. Without an SLO, every drift alert and every latency percentile is a judgment call. With one, the question becomes binary: is the budget burning above the threshold that warrants a page?

The multi-window, multi-burn-rate alerting pattern from the Google SRE Workbook pairs a fast-burn alert for acute outages with a slow-burn alert for gradual erosion. The illustrative constants the Workbook derives for a 99.9% objective over 30 days:

Severity      | Long window | Short window | Burn rate | Budget consumed
Page (fast)   | 1 hour      | 5 min        | 14.4x     | ~2%
Page (medium) | 6 hours     | 30 min       | 6x        | ~5%
Ticket (slow) | 3 days      | 6 hours      | 1x        | ~10%

These constants are illustrative — re-derive them from your own objective and window. The key insight they encode is that a short window alone generates flapping alerts, and a long window alone is too slow to catch an outage that burns 14x faster than steady state. Both windows must fire together to page.

For ML quality SLOs — prediction error rate, classification accuracy, embedding similarity score — the same alerting pattern applies. The SLI is the fraction of requests meeting the quality threshold; the error budget is the inverse of the objective. Symptom-based alerting on quality, not cause-based alerting on "the model version changed", is what makes triage actionable.

Severity and ownership

ML incidents need an explicit severity rubric because the silent-wrongness failure mode does not trip infrastructure alarms. A useful starting default:

SEV1: Wrong predictions affecting users at scale, or full serving outage.
      Page on-call immediately; consider rollback or fallback before root-cause.
SEV2: Quality degraded on a slice, or latency SLO burning fast.
      Page on-call; mitigate within the hour.
SEV3: Drift detected, no user-visible impact yet.
      Ticket; investigate in business hours.

Ownership is shared by design. The platform team owns the request path and observability infrastructure. The ML engineer and model owner own quality thresholds and the decision to roll back or retrain. The most common organisational failure here is alert-only monitoring with no one on the hook for ML quality — drift alerts that fire into a vacuum.

Runbook stubs

The three stubs below are deliberately skeletal. A runbook only becomes useful once an on-call engineer has localised it to their actual model servers, dashboards, and escalation paths. Each follows the same shape: confirm the alert is real, localise the failure, decide the failure class, then mitigate. Mitigation always favours a fast, reversible action — roll back, raise a replica floor, shed load — over live root-cause debugging, because reducing user impact buys the time to investigate calmly.

Runbook: drift alert

Signal: A monitoring job reports input-feature drift or prediction-distribution drift above threshold. Likely cause: Data issue (most often) or concept drift crossing quality SLO.

First three diagnostic queries:

  1. Did the detector itself change? Check the monitoring change log for a re-baselined reference window, a new threshold deployment, or a monitoring version bump. A notable fraction of drift pages are the detector drifting, not the data.
  2. Which features, which slice, since when? Broad drift across every feature suggests an upstream schema change or pipeline break. Narrow drift on one feature in one segment suggests a single data source. Pin the start time against the feature-pipeline and deploy timelines.
  3. Is the prediction quality SLO still intact? Query the quality SLI over the same window. If quality is within SLO despite input drift, this is virtual drift (covariate shift) — the model is still valid. If quality is breached, this is real concept drift and the severity escalates.

Rollback criterion: If the drift traces to a feature-pipeline break, roll back or fix the pipeline. If concept drift is confirmed and quality SLO is breached, escalate to a retraining decision. Drift alone, with quality inside SLO, is SEV3 — do not retrain reflexively.

Runbook: p99 latency spike

Signal: The inference p99 latency SLO error budget is burning above the fast-burn threshold. Likely cause: Serving issue — saturation, autoscaler thrash, cold-start tail, large-input tail, or a recent serving-layer change.

First three diagnostic queries:

  1. Read the latency histogram at p50 and p99, not the mean. A p99 spike with a flat p50 is a tail problem — a subset of requests (large inputs, cold replicas, slow downstream calls) — not a uniform slowdown. The mean hides exactly the requests that fired the page.
  2. Separate queue time from compute time. If your observability pipeline emits GenAI semantic convention spans, split time-in-queue from time-on-accelerator. Rising queue time points at saturation or autoscaler lag. Rising compute time points at larger inputs or a model/runtime change.
  3. Check saturation and recent scaling events. Inspect accelerator utilisation, batch queue depth, and replica count. Autoscaler thrash — scale up, cold start, scale down, repeat — is a classic tail-latency generator for model servers with slow warm-up.

Rollback criterion: Fast levers: raise the replica floor to absorb the tail, cap maximum batch size or input length, shed or queue non-critical traffic. If a recent serving-layer deploy correlates with the spike, roll it back — a serving regression is faster to revert than to debug live.

Runbook: prompt-injection attempt

Signal: A guardrail or input/output filter flags suspected prompt injection. Prompt injection is ranked LLM01:2025 — the top entry in the OWASP Top 10 for LLM Applications 2025. It applies to any LLM-backed system where user-supplied or retrieved content can influence model behaviour.

First three diagnostic queries:

  1. Classify the injection type. Direct injection arrives in the user's own prompt. Indirect injection arrives via content the model retrieves or ingests — a document, a web page, a tool result — and is the harder case because the payload is not visible in the conversation the operator sees. The OWASP taxonomy distinguishes these explicitly.
  2. Determine the blast radius. What could the injected turn reach: which tools, which data scopes, which downstream actions? If the session has tool or data access beyond a read scope, treat reachable secrets and side-effecting tools as potentially exercised.
  3. Preserve evidence before anything is purged. Capture the offending request, the retrieved context that carried the payload, and the model's output. For an indirect attack, snapshot the upstream source the payload rode in on — that source is the entry point to close.

Rollback criterion: Block or rate-limit the source, then close the gap: enforce least-privilege on tool and data access, segregate untrusted retrieved content from instruction context, and add the observed pattern to input/output validation. Prompt injection is mitigated by defence-in-depth, not by a single filter.

Common false pages

A disproportionate share of ML on-call pages are not incidents at all. Recognising them fast keeps the rotation sustainable and the signal-to-noise ratio high:

  • The monitor moved, not the system. A re-baselined drift reference window, a redeployed detector, or a threshold change presents as a step in the metric. Check the monitoring change log before the data.
  • Label lag masquerading as quality drop. When ground-truth labels arrive late, a quality metric computed over a partially-labelled recent window looks like degradation. Confirm the labelling completeness of the evaluation window before declaring a model issue.
  • Cold-start latency after a scale event. A p99 spike that resolves on its own within minutes of a scale-up or rollout is warm-up behaviour, not a regression — though it is a signal to raise the replica floor or pre-warm replicas.
  • Seasonality read as drift. Genuine, expected periodic shifts — weekday/weekend traffic patterns, end-of-month spikes — trip naive drift detectors. The fix is a seasonality-aware reference window, not a retrain.

After the incident: the blameless post-mortem

Every SEV1 and SEV2 earns a blameless post-mortem. The blameless norm — documented in the Google SRE Book, Chapter 15 — holds that incidents are the product of systems, not individual failures, and that the goal is to strengthen the system, not assign blame. Published templates such as the PagerDuty Post-Mortem Template provide a concrete starting structure.

The ML-specific addition to a standard post-mortem template is a surface-attribution line — data issue, model issue, or serving issue — because tracking which surface fails most often tells you where to invest. Three serving incidents in a row is a capacity or deployment-process problem. Three data incidents in a row is a data-quality or pipeline-validation problem. Surface attribution is how patterns become visible over quarters rather than individual incidents.

A post-mortem that ends without a concrete change to monitoring, the runbook, or a guardrail has not closed the loop. The five action categories with the highest rate of preventing recurrence: (1) add or tighten a detection signal, (2) fix the runbook so the next on-call engineer goes faster, (3) reduce the blast radius of the incident class, (4) improve promotion gates so the failure mode cannot reach production, and (5) update the severity rubric if the rubric missed the incident.

ML post-mortem template additions

ml-postmortem-additions.md
## ML-specific post-mortem additions

### Surface attribution
- [ ] Data issue  [ ] Model issue  [ ] Serving issue

### Drift/quality context
- Was a quality SLO breached? (Y/N, which SLO, for how long)
- Was this real drift (concept drift) or virtual drift (covariate shift)?
- Was label lag a contributing factor?

### Model and data state at incident time
- Model version in production at incident start:
- Last deploy timestamp:
- Feature pipeline last successful run:
- Reference dataset / training cut-off date:

### Detection gap
- How long between incident start and page?
- Which signal detected it first?
- What signal, if any, could have detected it earlier?

### Action items (surface-specific)
- [ ] Update drift detector reference window / threshold
- [ ] Add/improve quality SLI
- [ ] Tighten model promotion gate
- [ ] Fix/add runbook step
- [ ] Reduce tool/data blast radius for LLM incidents

Where this fits in the series

This article is the operational counterpart to the ML lifecycle stages article (series-order/05), which covers steady-state monitoring. The SLIs and error budgets this playbook acts on are defined and instrumented in the observability for ML article. Cost and FinOps context for sizing the replica floor and headroom decisions sits in the FinOps for ML article (series-order/07).

References

  1. Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A Survey on Concept Drift Adaptation. ACM Computing Surveys, 46(4), article 44. DOI: 10.1145/2523813. Establishes the virtual drift / real drift taxonomy.
  2. Beyer, B. et al. (Eds.). Google SRE Book — Service Level Objectives (Ch. 4). Google. sre.google/sre-book/service-level-objectives/. The SLI/SLO/error-budget model.
  3. Murphy, N. et al. Google SRE Workbook — Alerting on SLOs (Ch. 5). Google. sre.google/workbook/alerting-on-slos/. Multi-window, multi-burn-rate alerting pattern and worked constants.
  4. Beyer, B. et al. Google SRE Book — Monitoring Distributed Systems (Ch. 6). Google. sre.google/sre-book/monitoring-distributed-systems/. Symptom-based vs cause-based alerting; four golden signals.
  5. Beyer, B. et al. Google SRE Book — Postmortem Culture: Learning from Failure (Ch. 15). Google. sre.google/sre-book/postmortem-culture/. Blameless postmortem norm and methodology.
  6. OWASP Gen AI Security Project. LLM01:2025 Prompt Injection — OWASP Top 10 for LLM Applications 2025. genai.owasp.org/llmrisk/llm01-prompt-injection/. Direct vs indirect injection classification; top-ranked LLM risk.
  7. Sculley, D. et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015. proceedings.neurips.cc/paper/2015/…. ML-specific risk factors including input-distribution shift and hidden feedback loops.
  8. PagerDuty. Post-Mortem Template — PagerDuty Incident Response Documentation. response.pagerduty.com/after/post_mortem_template/. Published blameless post-mortem structure.

Tags

#incident-response#sre#series:ai-platform-mlops#series-order/08

About the Author

asleekgeek

asleekgeek

Senior Developer, Architect, DevOps

Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist

Thanks for reading! Explore more articles.

Back to Articles