AI Platform Engineering & MLOps Series · Part 6 of 34

Four organisational patterns for shipping ML

And when each one breaks. Centralised platform team, embedded MLOps, federated ownership, centre-of-excellence — the failure mode that ends each pattern, and how to read the signals before it happens.

9 min read·3 interactive components·7 references

CentralisedEmbeddedFederatedCoE Ring

How you organise the people who ship ML systems determines what kind of ML system you build. That is not a management platitude — it is Conway's Law, stated in 1968 and confirmed repeatedly in ML contexts since: any organisation that designs a system will produce a design that mirrors its own communication structure. A centralised MLOps team produces a centralised pipeline. A loosely coupled federation of product teams produces independently-owned pipelines that share almost nothing. Getting the org pattern right before scaling up model production is one of the highest-leverage decisions a platform lead can make, because the wrong pattern does not just slow you down — it creates technical debt that is sociological, not just architectural.

Conway's original formulation — “organisations which design systems are constrained to produce designs which are copies of the communication structures of these organisations” — was published in Datamation in 1968 [1]. The ML-systems literature has since documented the specific manifestation: Sculley et al.'s 2015 NeurIPS paper on hidden technical debt in ML systems identifies configuration glue code, undeclared consumers, and hidden feedback loops as failure modes that emerge precisely when the team that trains a model and the team that serves it are not the same — and have no well-defined handoff contract.

This article covers four patterns that recur consistently in engineering case studies and the academic literature. For each pattern, the analysis covers: the staffing shape, the consumer experience, the most common failure mode, and the maturity level at which it fits. A comparison table follows. The article closes with the match between org pattern and deployment context — because mismatching an org pattern with the wrong infrastructure strategy reliably produces a ticket queue rather than a platform.

The four patterns at a glance

Before the details: the four patterns are not a maturity ladder where each one supersedes the last. They are different bets about where ML knowledge and operational accountability should sit. The right choice depends on team size, model volume, regulatory posture, and how much infrastructure complexity the platform team is prepared to absorb on behalf of consumers.

1Centralised

3–8 MLOps engineers own the full lifecycle for all model-producing teams

Best fit: < 5 ML teams or regulated mandate

2Embedded

1–2 MLOps engineers sit inside each product team, no central ownership

Best fit: Large autonomous teams + shared rails

3Federated

Platform team owns rails; product teams own model code and configurations

Best fit: ≥ 10 ML engineers or 3+ product teams

4Platform-as-MLOps

Platform team owns entire lifecycle; product teams consume endpoints

Best fit: Regulated mandate or initiation phase

The explorer below lets you select each pattern and see its org-chart diagram, when it fits, and the failure mode — with the breaking-point indicator from the article.

Org Pattern Explorer

Select a pattern to see its org-chart, when it fits, and the failure mode that ends it.

Org Chart

One team owns the full lifecycle for all models

Centralised MLOps Team

When It Fits

Fewer than 5 model-producing teams
Regulatory mandate for a single accountable operator
Early-stage orgs with no shared vocabulary or tooling
Financial services / healthcare compliance requirements

Primary Failure Mode

Queue formation — the central team becomes a single point of failure and a sprint-priority battleground as model volume grows.

Breaking Point5 teams

When team count exceeds 5 and queue depth becomes the primary delivery constraint

Pattern 1 — Centralised MLOps team

Staffing shape

A dedicated team — typically three to eight engineers — owns the full ML lifecycle for all model-producing teams in the organisation. They run the infrastructure, set the standards, operate the pipelines, and often act as embedded consultants on individual projects. Model-producing product teams hand work off to this team; they do not own their own ML infrastructure.

Consumer experience

Consistent. One set of standards, one platform, one team to call when something breaks. For a product team that does not want to think about infrastructure, this is the lowest-friction starting point. The cost of that simplicity is throughput: the central team's capacity is the pipeline bottleneck.

Failure mode

Queue formation. As the number of model-producing teams grows, the central team becomes a single point of failure and a sprint-priority battleground. Product teams learn dependency rather than self-sufficiency. When the two most senior engineers on the central team leave, institutional memory for every production model leaves with them.

Maturity fit

Best at low ML-team count (fewer than five model-producing teams) or where a single accountable operator is a regulatory requirement. Organisations in regulated industries — financial services, healthcare, defence — sometimes mandate a single accountable owner for all model production decisions; the centralised pattern is structurally the best fit for that constraint. It is also the natural starting point before an organisation has enough ML team surface to justify the overhead of Pattern 3.

Practical Implication: If you choose Pattern 1 at initiation, name the transition signals now — queue depth exceeding two sprints, or more than five model-producing teams — and plan the migration to Pattern 3 before the queue becomes a crisis. The centralised team is meant to be a scaffold, not a permanent structure.

Pattern 2 — Embedded MLOps engineer

Staffing shape

One or two MLOps engineers sit inside each ML-producing product team, accountable to that team's delivery metrics rather than a central function. A shared platform team may still exist, but the embedded engineers are the primary lifecycle operators. In some organisations, dotted-line reporting to a platform or CoE lead maintains cross-team standards without centralising control.

Consumer experience

Fast. No central bottleneck. The embedded engineer knows the product's models personally, reduces handoff friction to near zero, and can move at the product team's sprint cadence. The team owns its own infrastructure pager — which is an on-call recruiting and retention cost for small teams.

Failure mode

Shadow infrastructure. When five product teams each have embedded engineers and no shared platform, each team builds its own monitoring stack, its own model registry configuration, its own CI gate — and they disagree with each other in subtle and expensive ways. Algorithmia's 2021 enterprise ML survey of 403 organisations found that 64% take a month or longer to deploy a model to production; a significant portion of that friction traces back to teams independently reinventing deployment plumbing rather than sharing it [3].

The structural risk is that embedded engineers, accountable to product delivery, systematically under-invest in platform-quality practices — testing, observability, retraining loops — in favour of shipping features. The pattern works best in combination with Pattern 3, where a shared platform team provides the golden paths and the embedded engineers stay on them.

Maturity fit

Organisations with large, autonomous product teams and high model volume, particularly effective when a platform team provides golden paths that embedded engineers stay on. Standalone (without shared rails), it is a transitional state that produces accumulating shadow infrastructure debt as the organisation scales.

Pattern 3 — Federated (platform team owns rails, product teams own paths)

Staffing shape

A platform team of two to six engineers owns and operates the shared infrastructure — the compute substrate, the registry, the serving platform, the CI/CD templates. Product teams own their own model code, pipelines, and deployment configurations. The relationship is explicitly producer-consumer, with a published API surface between them. In Team Topologies terms [2], this is the platform-team topology operating in X-as-a-Service interaction mode — the platform provides capability through a well-defined interface, minimising collaboration overhead on both sides.

Consumer experience

High velocity within the rails, with the cognitive load of understanding those rails. Teams are self-service for the common cases. The platform team is not in the critical path for routine model deployments — only for infrastructure changes. On-call is split: the platform team owns the shared infrastructure pager; product teams own the per-model pager.

Failure mode

Platform built for the average use case. Edge cases force teams off the paved path, and once a team is off-path, the platform team's leverage disappears. The second failure mode is the one the CNCF Platforms Working Group whitepaper [5] identifies as the critical differentiator: the platform team that treats the platform as infrastructure — best-effort, breaking changes shipped without deprecation cycles — rather than as a product with user research, documented APIs, and a feedback loop. DORA's 2023 State of DevOps research found that platform teams that prioritise developer needs achieve 40% higher organisational performance than teams that build without that user focus [4]. The federated pattern either succeeds by internalising this principle or fails by ignoring it.

Maturity fit

Most organisations above approximately ten ML engineers, or with three or more model-producing product teams. This is the pattern described in Skelton and Pais as the platform-team topology, and is the most commonly recommended pattern in the literature for ML teams at scale. It is the natural evolution from Pattern 1 once the central-team queue becomes the primary delivery constraint.

Practical Implication: The federated pattern fails not at the infrastructure level but at the product discipline level. The leading indicator of failure is when product teams start building their own monitoring pipelines or custom CI configurations rather than extending the platform's golden paths — a signal that the platform team has stopped doing user research with its consumers.

Pattern 4 — Platform team as MLOps (lifecycle as a platform capability)

Staffing shape

The platform team does not just own the infrastructure — it owns the entire lifecycle. Product teams hand off model code and data; the platform team runs training, evaluation, registry promotion, serving, and monitoring as managed services. Product teams consume versioned inference endpoints. Platform team headcount is six to twelve engineers for every ten consuming teams.

Consumer experience

Very low cognitive load for product teams. They own a model API, not a pipeline. They do not need ML infrastructure expertise. Onboarding a new product team is fast — the platform team handles the operational surface entirely. The centre of excellence model suits early-stage orgs that lack distributed ML ownership and regulated environments where a named accountable operator is mandatory.

Failure mode

The platform team becomes the ML department, and product teams atrophy their ability to experiment. Innovation velocity collapses as all model changes require platform-team scheduling. The platform team owns every production incident, including model quality degradations that require business-domain knowledge the platform team does not have. In practice, Pattern 4 creates an escalation path where the business-domain team and the platform team jointly diagnose failures that neither can fully understand alone.

Maturity fit

Regulated environments where a single accountable operator is mandated, and very small organisations that lack ML expertise in product teams. Financial services and healthcare contexts sometimes structurally require this pattern: the risk function demands a named team accountable for every model in production. In non-regulated contexts, Pattern 4 is rarely the right long-term choice — it is a starting point for organisations that cannot yet staff distributed ML ownership.

The four-pattern comparison table

The table below compares the four patterns across the dimensions that matter most to a platform lead making the initial structure decision.

Dimension	Centralised	Embedded	Federated	Platform-as-MLOps
Platform headcount / 10 teams	4–8	1–2 (advisory)	2–6	6–12
Infrastructure on-call	Platform team	Each product team	Split: shared → platform; per-model → team	Platform team
Model quality on-call	Platform + product jointly	Product team	Product team	Platform team (with escalation)
Primary failure mode	Queue formation	Shadow infrastructure	Platform for average case	Platform becomes ML dept
Best maturity fit	< 5 ML teams or regulated	Scaling + shared rails	Scaling to operating (≥10 eng.)	Initiation or regulated
Consumer experience	Consistent; slow	Fast; inconsistent	Fast within rails	Low cognitive load; low autonomy

The advisor below takes your org parameters and recommends a pattern with the article's reasoning and the failure mode to watch for.

Pattern Fit Advisor

Set your org parameters to get a pattern recommendation with reasoning from the article.

ML-producing team count5

120+

Platform maturity

Regulatory pressure

Team distribution

Recommended Pattern

Federated (Platform-as-Product)

With 5 ML teams and medium platform maturity, Pattern 3 (federated) is the recommended path. The platform team provides compute substrate, registry, serving, and CI/CD templates. Product teams own their model code and pipeline configurations. The platform team must operate with a product mindset — user research, documented APIs, feedback loops.

Failure Mode to Watch For

Platform team without product discipline. Treating the ML platform as 'just infrastructure' (best-effort, breaking changes without deprecation) is the leading cause of federated model failure.

The on-call question is the diagnostic

None of these four patterns gives a clear default answer to the question: who answers the pager at 2 a.m. for a model quality degradation in production? Naming that answer explicitly — before the incident happens — is more important than which pattern an organisation chooses. A model quality degradation is not a pure infrastructure problem: it requires business-domain knowledge (what does a 5% drop in precision mean for this product?), model knowledge (which feature distribution has drifted?), and infrastructure knowledge (is the pipeline serving stale features, or is there a genuine model regression?). The three knowledge types sit in different teams in Patterns 2 and 3, and all in the platform team in Patterns 1 and 4.

Sculley et al. name this as the organisational dimension of ML technical debt [6]: misalignment between the team that trains a model and the team that serves it creates configuration glue code, undeclared consumers, and hidden feedback loops. The on-call answer is a forcing function that reveals this misalignment before it becomes debt.

Diagnostic Question: Before choosing a pattern, write down the name of the engineer who owns the pager for a model quality degradation in production. If you cannot name them, the pattern you choose will not matter — you are already accumulating the technical debt Sculley et al. described.

Matching the org pattern to the deployment context

The four deployment contexts — pure-cloud, on-prem, hybrid, regulated/air-gapped— change which org pattern is viable. The infrastructure strategy determines the platform's shape; the org pattern determines who runs it. They must be consistent.

In pure-cloud, the managed-service layer reduces the operational burden enough that Pattern 3 (federated) is viable at relatively small team sizes — a two-person platform team can operate a functional shared ML platform on top of managed training, managed registry, and managed serving services. On-prem or hybrid deployments require more platform-team headcount to operate the same capability surface, which pushes earlier-stage organisations toward Pattern 1 (centralised) until they can staff the platform team adequately.

Regulated and air-gapped environments introduce a structural constraint that overrides the maturity-based recommendation: if a single accountable operator is mandated by the regulatory framework, the choice is between Pattern 1 and Pattern 4 regardless of team size. The difference is the degree of product-team autonomy the regulator permits.

ThoughtWorks' Continuous Delivery for Machine Learning framework [7] names the discipline coordination requirement directly: shipping ML in production requires data engineering, data science, testing, infrastructure engineering, and release engineering to coordinate. The org pattern is the mechanism that structures that coordination. A mismatch — a federated infrastructure with a centralised MLOps team, for example — produces a ticket queuerather than a platform, because every product team's infrastructure request lands in the central team's backlog.

The maturity-level thresholds

Each pattern is a good fit at a specific maturity level, and a poor fit above or below it. These threshold observations connect forward to the maturity-model article in this series, which covers the five-level progression in detail.

Pattern 1 (Centralised) fits maturity levels 1 and 2 — the initiation and repeatable stages. At level 1, there is no shared vocabulary, no shared tooling, and the central team's ability to impose consistency is a feature, not a constraint. At level 2, the central team has established standards and the pipeline is operational, but the organisation has not yet scaled model production to the point where the queue is the primary bottleneck.

Pattern 3 (Federated) becomes the right answer at maturity levels 3 and 4 — managed and governed. At level 3, the platform has enough surface area to be operated as a product with defined APIs and deprecation contracts. At level 4, the platform is governed: cost attribution, access control, and audit trails are first-class platform capabilities that the federated model surfaces to product teams through self-service interfaces.

Pattern 2 (Embedded) is a durable hybrid at all maturity levels, but only in combination with a shared platform team providing rails. Standalone, it is a transitional state that accumulates shadow infrastructure debt as the organisation matures.

Pattern 4 (Platform-as-MLOps) fits level 1 in non-regulated contexts — organisations that have not yet distributed ML ownership — and is a permanent fit in heavily regulated contexts where the constraint is structural rather than maturity-based.

The ML lifecycle article in this series established what work happens at each stage of a production ML system. This article has established who owns that work in four structural arrangements. The next article in Part 2 covers FinOps for AI: once you know who owns the ML lifecycle, the question becomes how to attribute and control the cost of running it — the showback-to-chargeback ladder, the label scheme, and the unit economics of training and inference.

The org pattern also shapes the maturity-model trajectory. An organisation that starts at Pattern 1 and does not plan the transition to Pattern 3 will find the centralised team becomes a permanent bottleneck rather than a temporary scaffold. Planning the transition — naming the signals that trigger it (queue depth, team count, model volume) — is part of the initial org-pattern decision, not an afterthought.

References

Continue the Journey

AI Platform

Four organisational patterns for shipping ML

The four patterns at a glance

Org Pattern Explorer

Centralised MLOps Team

Pattern 1 — Centralised MLOps team

Staffing shape

Consumer experience

Failure mode

Maturity fit

Pattern 2 — Embedded MLOps engineer

Staffing shape

Consumer experience

Failure mode

Maturity fit

Pattern 3 — Federated (platform team owns rails, product teams own paths)

Staffing shape

Consumer experience

Failure mode

Maturity fit

Pattern 4 — Platform team as MLOps (lifecycle as a platform capability)

Staffing shape

Consumer experience

Failure mode

Maturity fit

The four-pattern comparison table

Pattern Fit Advisor

Federated (Platform-as-Product)

The on-call question is the diagnostic

Matching the org pattern to the deployment context

The maturity-level thresholds

References

Continue the Journey

What is an AI Platform team?

Roles, skills, and tools matrix for AI Platform teams

AI Platform maturity model