AI Platform Engineering & MLOps Series · Part XXIII of 34

AI Platform maturity — five levels and the single move that unlocks each

Five levels from ad-hoc to self-improving, the evidence pattern at each, and the single highest-leverage move that advances you to the next.

12 min read·2 interactive components·8 references

Ad-hocRepeatableManagedGovernedSelf-improving

Maturity modelshave a long lineage in software engineering. Watts Humphrey’s 1988 paper for IEEE Software introduced the five-level staged framework for characterising software process capability [1], and the Carnegie Mellon SEI formalised it as the Capability Maturity Model (CMM)in 1993, naming the levels Initial, Repeatable, Defined, Managed, and Optimizing [2]. The same structure — with adapted vocabulary — now anchors every major platform-maturity framework in the MLOps and platform-engineering space: Microsoft’s MLOps Maturity Model [3], Google Cloud’s MLOps pipeline-automation levels [4], and the CNCF Platform Engineering Maturity Model [5].

This article uses a five-level vocabulary — Ad-hoc, Repeatable, Managed, Governed, Self-improving— that maps closely to the Microsoft and CMM lineages while being framed specifically for an AI/ML platform context. The model is additive: each level inherits the capabilities of the one below it. Progression is not measured by tool installation but by evidence: what does the organisation’s normal operating behaviour look like, reproducibly, day after day?

Where most organisations are today

Practitioner surveys consistently place most mid-size organisations at an equivalent of Level 2–3: shared tooling is present, adoption is inconsistent, and CI/CD for models exists in intent but is not uniformly enforced. The DORA 2024 State of DevOps Report found that organisations with a dedicated internal developer platform reported markedly higher deployment frequencies than those without one [6], suggesting that investing in platform infrastructure as a product — the defining move from Level 1 to Level 2 — has measurable engineering-velocity consequences. An empirical multi-case study across 14 organisations published in 2025 validated a maturity framework for MLOps adoption and confirmed that teams commonly stall at mid-levels due to the combination of tooling complexity and organisational inertia [7].

The five levels at a glance

The table below summarises each level across three columns: the definition (what the level means in operational terms), the evidence pattern (what you can observe in a working environment at this level), and the unlock move (the single highest-leverage action that advances the organisation to the next level).

Level 1 — Ad-hoc

Definition: No shared infrastructure for AI/ML. Every model deployment is a bespoke artefact. Knowledge is siloed in individuals.

Evidence pattern: Models trained in notebooks on individual machines; no model versioning beyond file timestamps; deployments are bespoke scripts with no health checking; GPU access is informal (a single shared VM allocated by convention); no one can enumerate how many models are in production.

Unlock move: Designate the platform as an internal product with a named owner and a written charter. Provide one shared Kubernetes cluster with a GPU node pool and one shared experiment tracker. The technical investment is modest; the organisational decision — that platform is a product, not an IT side-task — is the real move.

Anti-pattern: Teams believe they are at Level 2 because they have a shared Jupyter hub. A shared notebook server is shared compute, not a platform. If you cannot reproduce a trained model’s artefact from a logged experiment run, you are at Level 1.

Level 2 — Repeatable

Definition: One shared cluster and GPU pool; at least one golden-path template that a team other than the platform team has used successfully without hand-holding; a shared experiment tracker in active use.

Evidence pattern: A central AI inventory lists at least 70% of production models; CI runs model evaluation before merge; one design-partner team has a model in production via the golden path; cost for GPU infrastructure can be attributed to at least a department, if not a team.

Unlock move: Get paved-road adoption to 40% of new workloads without mandating it. If adoption requires a mandate, the road is not good enough — fix the developer experience, not the policy. This is the litmus test for Level 3 readiness.

Anti-pattern: The model registryexists but has no stage gates, so “production” and “experiment” artefacts coexist in the same store. A registry without promotion gates is a file server with a nicer UI.

Level 3 — Managed

Definition: 40–70% of new AI workloads start on a golden path; CI/CD for models is the norm; policy-as-code admission gates block production deploys that lack evaluation evidence; cost attribution is accurate enough to produce a showback dashboard.

Evidence pattern: Developer NPS is tracked; a second golden path has been added based on user research; policy gates run automatically at deploy time rather than via manual review; GPU-hours and token spend are attributed per team on a live dashboard.

Unlock move: Fully automate governance infrastructure — lineage capture, risk-tier labelling, and evidence-pack generation— for all production workloads. The target: any production AI system’s complete audit pack is generatable in under a day without manual assembly.

Anti-pattern: The platform team is drowning in support tickets because the golden paths do not cover the long tail of use cases. A platform measured by ticket volume, not paved-road adoption, is managing its customers rather than serving them.

Level 4 — Governed

Definition: 80%+ of production AI workloads on the platform; cost attribution accurate enough for fair chargeback; every production workload has a risk-tier label, full lineage, and an up-to-date model card; the governance board can pull an evidence pack for any system in under an hour.

Evidence pattern: Supply-chain attestation (SBOMs, signed model artefacts) is standard; risk-tier labels are applied automatically at registration time and map to controls from frameworks such as NIST AI RMF and the EU AI Act; automated evidence-pack generation is a platform service, not a manual task.

Unlock move: Establish an inner-source contribution model so the platform team is no longer the sole source of new capabilities. Product teams PR golden paths and plugins; the platform team reviews and maintains standards. Until this is in place, Level 5 is structurally unreachable — one team cannot scale to serve every advanced use case.

Anti-pattern: Governance infrastructure is rigorous but adoption is plateauing because the paved road has become a mandatory highway with no escape valves. Governance that optimises for audit readiness at the expense of developer velocity will trigger shadow IT — teams route around the platform rather than through it.

Level 5 — Self-improving

Definition: The platform team’s primary output is standards and tooling for inner-source contributors, not hands-on feature delivery; capacity planning is predictive; model-quality monitoring triggers automated retraining for lower-risk tiers without human intervention.

Evidence pattern: Product teams contribute golden paths and platform plugins via pull requests, publicly credited in the platform changelog; capacity planning uses historical utilisation trends to pre-provision for planned campaigns; platform deployment frequency for models and lead time for prompt changes are tracked as first-class KPIs against DORA benchmarks [8].

Unlock move: There is no Level 6. The move at Level 5 is to avoid the specific failure mode of the level: optimising one capability area (e.g. automated retraining) to Level 5 while remaining at Level 2 in another (e.g. governance). Sustained Level 5 requires balanced capability development across all seven areas of the platform surface.

Anti-pattern: Lopsidedness. The failure mode of maturity models is not stagnation — teams that have genuinely reached Level 5 in one axis often discover they are Level 2 in another. A continuous-delivery pipeline for models means little if the governance infrastructure cannot keep pace with deployment frequency.

The explorer below lets you inspect any level in detail or compare two levels side by side across definition, evidence pattern, unlock move, and anti-pattern.

Maturity Level Explorer

Select one level to inspect its definition, evidence pattern, unlock move, and anti-pattern. Select two to compare them side by side.

Level 1 — Ad-hoc

No shared infra · Siloed knowledge · Bespoke deployments

Definition: No shared infrastructure for AI/ML. Every model deployment is a bespoke artefact. Knowledge is siloed in individuals.
Evidence pattern: Models trained in notebooks on individual machines; no model versioning beyond file timestamps; deployments are bespoke scripts with no health checking; GPU access is informal (a single shared VM allocated by convention); no one can enumerate how many models are in production.
Unlock move: Designate the platform as an internal product with a named owner and a written charter. Provide one shared Kubernetes cluster with a GPU node pool and one shared experiment tracker. The technical investment is modest; the organisational decision — that platform is a product, not an IT side-task — is the real move.
Anti-pattern: Teams believe they are at Level 2 because they have a shared Jupyter hub. A shared notebook server is shared compute, not a platform. If you cannot reproduce a trained model's artefact from a logged experiment run, you are at Level 1.

Select a second level to compare dimensions side by side.

DORA metrics as a cross-check

Forsgren, Humble, and Kim’s Accelerate established four software delivery metrics — deployment frequency, lead time for changes, time to restore service, and change failure rate — as the empirically validated indicators of high organisational performance [8]. These metrics translate directly to the AI platform context:

Deployment frequency for models: how often is a new or updated model version promoted to production? At Level 2 this is monthly or ad-hoc; at Level 4–5 it is weekly or on-demand.
Lead time for changes: how long from a merged model or prompt change to a production deployment? At Level 2 this is days to weeks (manual approval chains); at Level 4 it is hours (automated evaluation gates).
Time to restore: when a model degrades in production, how long until a rollback or fix is deployed? At Level 3 the platform provides canary rollout and automated rollback; at Level 2 rollback is a manual operation.
Change failure rate: what fraction of model deployments require a hotfix or rollback? A high rate at Level 3 signals that evaluation gates are present but not sufficient — the gate criteria need tightening, not the deployment pipeline.

The DORA 2024 report found that organisations using an internal developer platform reported higher overall software delivery performance than those without one [6], corroborating the model’s framing: the Level 1-to-2 transition — treating the platform as a product — has a measurable, reproducible effect on delivery throughput.

How the frameworks compare

Three published frameworks are useful anchors when positioning your organisation on this scale.

Microsoft MLOps Maturity Model [3] defines five levels (0 through 4) assessed across three dimensions: people and culture, processes and structures, and objects and technology. Level 0 is no MLOps (equivalent to ad-hoc); Level 4 is Full MLOps Automated Operations (equivalent to self-improving). The Microsoft model is operationally detailed and worth reading as a checklist for Levels 0–2.

Google Cloud MLOps levels [4] use three levels (0–2) focused tightly on pipeline automation: Level 0 is manual scripts; Level 1 is automated training pipelines; Level 2 adds CI/CD pipeline automation for the pipelines themselves. The Google model is narrower in scope than the five-level model here (it focuses on training and deployment pipelines, not governance or developer experience), but it is precise on the CI/CD mechanics of Levels 2–3.

CNCF Platform Engineering Maturity Model [5] defines four levels (Provisional, Operational, Scalable, Optimizing) assessed across five dimensions: investment, adoption, interfaces, operations, and measurement. It is broader than an ML-specific model — it applies to any internal platform — but its Scalable and Optimizing levels map closely to Levels 4 and 5 here, particularly on the inner-source and self-service interface dimensions.

This article	Microsoft MLOps [3]	Google Cloud [4]	CNCF Platform Eng [5]
Level 1 — Ad-hoc	Level 0 — No MLOps	Level 0 — Manual	Provisional
Level 2 — Repeatable	Level 1 — DevOps, No MLOps	Level 1 — Automated training	Operational
Level 3 — Managed	Level 2 — Automated Training	Level 2 — CI/CD for pipelines	Scalable
Level 4 — Governed	Level 3 — Automated Model Deployment	—	Scalable / Optimizing
Level 5 — Self-improving	Level 4 — Full MLOps Automated Ops	—	Optimizing

The capability surface at each level

The seven capability areas of an AI platform — data, training, registry, serving, observability, governance, and developer experience — each progress through the five levels. The capability surface article in this series maps the minimum-viable and mature states for each area. The key architectural principle that runs through all five levels is this: the platform team owns the shared infrastructure layer; the consuming squad owns the models, feature logic, evaluation criteria, and business SLOs. That boundary should be explicit and documented at every level — blurring it is the root cause of the most common Level 3 failure mode (platform team as bottleneck).

Context shapes the timeline, not the destination

The five levels apply regardless of deployment context — pure cloud, hybrid, on-premises, or regulated air-gapped — but the speed and constraints of each transition differ significantly.

Pure-cloud deployments have managed services for most Level 2–3 infrastructure, so the bottleneck is practice and culture, not infrastructure availability. Level 3 is reachable faster here than in any other context.

On-premises deployments require a platform engineering team of at least 4–6 people to operate the full self-hosted stack reliably through Level 3. Below that headcount, the operational overhead of self-managed infrastructure tends to consume the capacity needed for paved-road development, keeping the team at Level 2 regardless of intent.

Regulated or air-gapped deployments often face Level 4 as a regulatory floor rather than an organisational aspiration. Frameworks such as the EU AI Act and NIST AI RMF impose audit-trail, evidence-pack, and lineage requirements that are structurally Level 4 capabilities — meaning regulated organisations cannot reach production for high-risk AI systems without first solving the governance infrastructure problem.

Using this model in practice

A maturity model is a diagnostic instrument, not a scorecard. Three practical uses:

1Scope the next quarter's platform roadmap. Identify which capabilities are at Level N and which are at Level N-1. Build the unlock move for the lowest-level capability first — lopsidedness is the dominant failure mode at Level 4 and above.
2Frame the business case. DORA's findings [6][8] give you the evidence that deployment frequency and lead time improvements translate to engineering throughput gains. Map the unlock move to a DORA metric improvement to make the argument in terms the CFO can evaluate.
3Align hiring to the right level. A team at Level 1 needs a platform lead who can establish product discipline and build the first golden path — the skill set is as much programme management as infrastructure engineering. A team at Level 4 building toward Level 5 needs engineers who can design contribution models and operate at the standards-and-tooling level, not individual feature builders.

Diagnostic instrument, not scorecard: A maturity level is only as useful as the conversation it prompts. The goal is not to achieve a number — it is to identify the single unlock movethat has the highest expected value for your organisation’s next quarter.

The assessor below walks through six diagnostic questions about your organisation’s current AI platform behaviour and places you on the five-level ladder, surfacing the unlock move for your level.

Platform Maturity Assessor

Answer six diagnostic questions about your organisation’s current AI platform behaviour. The assessor places you on the five-level ladder and surfaces the single unlock move for your level.

Question 1 of up to 6

Is there a named owner and written charter for the AI platform as an internal product?

Not just a shared cluster — a person accountable for the platform as a product with a written scope.

References

[1] Humphrey, W.S. “Characterizing the Software Process: A Maturity Framework.” IEEE Software 5(2), 1988, pp. 73–79. DOI: 10.1109/52.2014. dl.acm.org/doi/10.1109/52.2014
[2] Paulk, M.C. et al. “The Capability Maturity Model for Software.” Carnegie Mellon SEI, 1993. sunnyday.mit.edu/16.355/cmm.pdf
[3] Microsoft. “MLOps Maturity Model.” Azure Architecture Center, Microsoft Learn. learn.microsoft.com
[4] Google Cloud. “MLOps: Continuous delivery and automation pipelines in machine learning.” Cloud Architecture Center. cloud.google.com
[5] CNCF TAG App Delivery. “Platform Engineering Maturity Model.” November 2023. tag-app-delivery.cncf.io
[6] DORA. “Accelerate State of DevOps Report 2024.” Google Cloud, 2024. dora.dev/research/2024/dora-report/
[7] Bergström et al. “An empirical guide to MLOps adoption: Framework, maturity model and taxonomy.” Information and Software Technology 183, 2025. DOI: 10.1016/j.infsof.2025.107725. sciencedirect.com
[8] Forsgren, N., Humble, J., Kim, G. Accelerate: The Science of Lean Software and DevOps. IT Revolution Press, 2018. ISBN 978-1942788331. itrevolution.com/product/accelerate/

Continue the Journey

AI Platform

AI Platform maturity — five levels and the single move that unlocks each

Where most organisations are today

The five levels at a glance

Level 1 — Ad-hoc

Level 2 — Repeatable

Level 3 — Managed

Level 4 — Governed

Level 5 — Self-improving

Maturity Level Explorer

Level 1 — Ad-hoc

DORA metrics as a cross-check

How the frameworks compare

The capability surface at each level

Context shapes the timeline, not the destination

Using this model in practice

Platform Maturity Assessor

References

Continue the Journey

Golden Paths for ML — paved-road templates that survive contact with users

What is an AI platform team?

Governance and Lineage — model cards, lineage capture, and the four audit questions

What Is MLOps in 2026?