Four organisational patterns for shipping ML — and when each one breaks

The four ML org patterns and their interaction shapes
How you organise the people who ship ML systems determines what kind of ML system you build. That is not a management platitude — it is Conway's Law, stated in 1968 and confirmed repeatedly in ML contexts since: any organisation that designs a system will produce a design that mirrors its own communication structure. A centralised MLOps team produces a centralised pipeline. A loosely coupled federation of product teams produces independently-owned pipelines that share almost nothing. Getting the org pattern right before scaling up model production is one of the highest-leverage decisions a platform lead can make, because the wrong pattern does not just slow you down — it creates technical debt that is sociological, not just architectural.
Conway's original formulation — "organisations which design systems are constrained to produce designs which are copies of the communication structures of these organisations" — was published in Datamation in 1968 [1]. The ML-systems literature has since documented the specific manifestation: Sculley et al.'s 2015 NeurIPS paper on hidden technical debt in ML systems identifies configuration glue code, undeclared consumers, and hidden feedback loops as failure modes that emerge precisely when the team that trains a model and the team that serves it are not the same — and have no well-defined handoff contract.
This article covers four patterns that recur consistently in engineering case studies and the academic literature. For each pattern, the analysis covers: the staffing shape, the consumer experience, the most common failure mode, and the maturity level at which it fits. A comparison table follows. The article closes with the match between org pattern and deployment context — because mismatching an org pattern with the wrong infrastructure strategy reliably produces a ticket queue rather than a platform.
The four patterns at a glance
Before the details: the four patterns are not a maturity ladder where each one supersedes the last. They are different bets about where ML knowledge and operational accountability should sit. The right choice depends on team size, model volume, regulatory posture, and how much infrastructure complexity the platform team is prepared to absorb on behalf of consumers.
Pattern 1 — Centralised MLOps team
Staffing shape
A dedicated team — typically three to eight engineers — owns the full ML lifecycle for all model-producing teams in the organisation. They run the infrastructure, set the standards, operate the pipelines, and often act as embedded consultants on individual projects. Model-producing product teams hand work off to this team; they do not own their own ML infrastructure.
Consumer experience
Consistent. One set of standards, one platform, one team to call when something breaks. For a product team that does not want to think about infrastructure, this is the lowest-friction starting point. The cost of that simplicity is throughput: the central team's capacity is the pipeline bottleneck.
Failure mode
Queue formation. As the number of model-producing teams grows, the central team becomes a single point of failure and a sprint-priority battleground. Product teams learn dependency rather than self-sufficiency. When the two most senior engineers on the central team leave, institutional memory for every production model leaves with them.
Maturity fit
Best at low ML-team count (fewer than five model-producing teams) or where a single accountable operator is a regulatory requirement. Organisations in regulated industries — financial services, healthcare, defence — sometimes mandate a single accountable owner for all model production decisions; the centralised pattern is structurally the best fit for that constraint. It is also the natural starting point before an organisation has enough ML team surface to justify the overhead of Pattern 3.
Pattern 2 — Embedded MLOps engineer
Staffing shape
One or two MLOps engineers sit inside each ML-producing product team, accountable to that team's delivery metrics rather than a central function. A shared platform team may still exist, but the embedded engineers are the primary lifecycle operators.
Consumer experience
Fast. No central bottleneck. The embedded engineer knows the product's models personally, reduces handoff friction to near zero, and can move at the product team's sprint cadence. The team owns its own infrastructure pager — which is an on-call recruiting and retention cost for small teams.
Failure mode
Shadow infrastructure. When five product teams each have embedded engineers and no shared platform, each team builds its own monitoring stack, its own model registry configuration, its own CI gate — and they disagree with each other in subtle and expensive ways. Algorithmia's 2021 enterprise ML survey of 403 organisations found that 64% take a month or longer to deploy a model to production; a significant portion of that friction traces back to teams independently reinventing deployment plumbing rather than sharing it [3].
The structural risk is that embedded engineers, accountable to product delivery, systematically under-invest in platform-quality practices — testing, observability, retraining loops — in favour of shipping features. The pattern works best in combination with Pattern 3, where a shared platform team provides the rails and the embedded engineers stay on them.
Maturity fit
Organisations with large, autonomous product teams and high model volume, particularly effective when a platform team provides golden paths that embedded engineers stay on. Standalone (without shared rails), it is a transitional state that produces accumulating shadow infrastructure debt as the organisation scales.
Pattern 3 — Federated (platform team owns rails, product teams own paths)
Staffing shape
A platform team of two to six engineers owns and operates the shared infrastructure — the compute substrate, the registry, the serving platform, the CI/CD templates. Product teams own their own model code, pipelines, and deployment configurations. The relationship is explicitly producer-consumer, with a published API surface between them. In Team Topologies terms [2], this is the platform-team topology operating in X-as-a-Service interaction mode — the platform provides capability through a well-defined interface, minimising collaboration overhead on both sides.
Consumer experience
High velocity within the rails, with the cognitive load of understanding those rails. Teams are self-service for the common cases. The platform team is not in the critical path for routine model deployments — only for infrastructure changes. On-call is split: the platform team owns the shared infrastructure pager; product teams own the per-model pager.
Failure mode
Platform built for the average use case. Edge cases force teams off the paved path, and once a team is off-path, the platform team's leverage disappears. The second failure mode is the one the CNCF Platforms Working Group whitepaper [5] identifies as the critical differentiator: the platform team that treats the platform as infrastructure — best-effort, breaking changes shipped without deprecation cycles — rather than as a product with user research, documented APIs, and a feedback loop. DORA's 2023 State of DevOps research found that platform teams that prioritise developer needs achieve 40% higher organisational performance than teams that build without that user focus [4]. The federated pattern either succeeds by internalising this principle or fails by ignoring it.
Maturity fit
Most organisations above approximately ten ML engineers, or with three or more model-producing product teams. This is the pattern described in Skelton and Pais as the platform-team topology, and is the most commonly recommended pattern in the literature for ML teams at scale. It is the natural evolution from Pattern 1 once the central-team queue becomes the primary delivery constraint.
Pattern 4 — Platform team as MLOps (lifecycle as a platform capability)
Staffing shape
The platform team does not just own the infrastructure — it owns the entire lifecycle. Product teams hand off model code and data; the platform team runs training, evaluation, registry promotion, serving, and monitoring as managed services. Product teams consume versioned inference endpoints. Platform team headcount is six to twelve engineers for every ten consuming teams.
Consumer experience
Very low cognitive load for product teams. They own a model API, not a pipeline. They do not need ML infrastructure expertise. Onboarding a new product team is fast — the platform team handles the operational surface entirely.
Failure mode
The platform team becomes the ML department, and product teams atrophy their ability to experiment. Innovation velocity collapses as all model changes require platform-team scheduling. The platform team owns every production incident, including model quality degradations that require business-domain knowledge the platform team does not have. In practice, Pattern 4 creates an escalation path where the business-domain team and the platform team jointly diagnose failures that neither can fully understand alone.
Maturity fit
Regulated environments where a single accountable operator is mandated, and very small organisations that lack ML expertise in product teams. Financial services and healthcare contexts sometimes structurally require this pattern: the risk function demands a named team accountable for every model in production. In non-regulated contexts, Pattern 4 is rarely the right long-term choice — it is a starting point for organisations that cannot yet staff distributed ML ownership.
The four-pattern comparison table
The table below compares the four patterns across the dimensions that matter most to a platform lead making the initial structure decision.
Dimension | Centralised | Embedded | Federated | Platform-as-MLOps
Platform headcount per 10 consuming teams | 4–8 | 1–2 (advisory) | 2–6 | 6–12
Infrastructure on-call | Platform team | Each product team | Split: shared → platform; per-model → team | Platform team
Model quality on-call | Platform + product jointly | Product team | Product team | Platform team (with escalation to product)
Primary failure mode | Queue formation; single point of failure | Shadow infrastructure; 5 teams build 5 monitoring systems | Platform built for the average case; edge cases produce off-path teams | Platform becomes the ML department; product teams lose ability to experiment
Best maturity fit | Initiation (< 5 ML teams) or regulated mandate | Scaling (combined with shared rails) | Scaling to operating; most ML teams above 10 engineers | Initiation (no distributed ML ownership) or regulated mandate
Consumer experience | Consistent; slow | Fast; inconsistent | Fast within rails | Low cognitive load; low autonomy
The on-call question is the diagnostic
None of these four patterns gives a clear default answer to the question: who answers the pager at 2 a.m. for a model quality degradation in production? Naming that answer explicitly — before the incident happens — is more important than which pattern an organisation chooses. A model quality degradation is not a pure infrastructure problem: it requires business-domain knowledge (what does a 5% drop in precision mean for this product?), model knowledge (which feature distribution has drifted?), and infrastructure knowledge (is the pipeline serving stale features, or is there a genuine model regression?). The three knowledge types sit in different teams in Patterns 2 and 3, and all in the platform team in Patterns 1 and 4.
Sculley et al. name this as the organisational dimension of ML technical debt [6]: misalignment between the team that trains a model and the team that serves it creates configuration glue code, undeclared consumers, and hidden feedback loops. The on-call answer is a forcing function that reveals this misalignment before it becomes debt.
Matching the org pattern to the deployment context
The four deployment contexts — pure-cloud, on-prem, hybrid, regulated/air-gapped — change which org pattern is viable. The infrastructure strategy determines the platform's shape; the org pattern determines who runs it. They must be consistent.
In pure-cloud, the managed-service layer reduces the operational burden enough that Pattern 3 (federated) is viable at relatively small team sizes — a two-person platform team can operate a functional shared ML platform on top of managed training, managed registry, and managed serving services. On-prem or hybrid deployments require more platform-team headcount to operate the same capability surface, which pushes earlier-stage organisations toward Pattern 1 (centralised) until they can staff the platform team adequately.
Regulated and air-gapped environments introduce a structural constraint that overrides the maturity-based recommendation: if a single accountable operator is mandated by the regulatory framework, the choice is between Pattern 1 and Pattern 4 regardless of team size. The difference is the degree of product-team autonomy the regulator permits.
ThoughtWorks' Continuous Delivery for Machine Learning framework [7] names the discipline coordination requirement directly: shipping ML in production requires data engineering, data science, testing, infrastructure engineering, and release engineering to coordinate. The org pattern is the mechanism that structures that coordination. A mismatch — a federated infrastructure with a centralised MLOps team, for example — produces a ticket queue rather than a platform, because every product team's infrastructure request lands in the central team's backlog.
The maturity-level thresholds
Each pattern is a good fit at a specific maturity level, and a poor fit above or below it. These threshold observations connect forward to the maturity-model article in this series, which covers the five-level progression in detail.
Pattern 1 (Centralised) fits maturity levels 1 and 2 — the initiation and repeatable stages. At level 1, there is no shared vocabulary, no shared tooling, and the central team's ability to impose consistency is a feature, not a constraint. At level 2, the central team has established standards and the pipeline is operational, but the organisation has not yet scaled model production to the point where the queue is the primary bottleneck.
Pattern 3 (Federated) becomes the right answer at maturity levels 3 and 4 — managed and governed. At level 3, the platform has enough surface area to be operated as a product with defined APIs and deprecation contracts. At level 4, the platform is governed: cost attribution, access control, and audit trails are first-class platform capabilities that the federated model surfaces to product teams through self-service interfaces.
Pattern 2 (Embedded) is a durable hybrid at all maturity levels, but only in combination with a shared platform team providing rails. Standalone, it is a transitional state that accumulates shadow infrastructure debt as the organisation matures.
Pattern 4 (Platform-as-MLOps) fits level 1 in non-regulated contexts (organisations that have not yet distributed ML ownership) and is a permanent fit in heavily regulated contexts where the constraint is structural rather than maturity-based.
What the previous article established, and what comes next
The ML lifecycle article in this series established what work happens at each stage of a production ML system. This article has established who owns that work in four structural arrangements. The next article in Part 2 covers FinOps for AI: once you know who owns the ML lifecycle, the question becomes how to attribute and control the cost of running it — the showback-to-chargeback ladder, the label scheme, and the unit economics of training and inference.
The org pattern also shapes the maturity-model trajectory. An organisation that starts at Pattern 1 and does not plan the transition to Pattern 3 will find the centralised team becomes a permanent bottleneck rather than a temporary scaffold. Planning the transition — naming the signals that trigger it (queue depth, team count, model volume) — is part of the initial org-pattern decision, not an afterthought.
References
[1] Conway, Melvin E. "How Do Committees Invent?" Datamation, Vol. 14, No. 4, April 1968, pp. 28–31.
Tags
About the Author

asleekgeek
Senior Developer, Architect, DevOps
Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist
Thanks for reading! Explore more articles.
Back to Articles