AI Platform Engineering & MLOps Series · Part 4 of 34

Pure-cloud, on-prem, hybrid, air-gapped

The deployment-context spectrum — four contexts defined across five axes, with a standing warning about recommendations that silently assume pure-cloud.

9 min read·2 interactive components·6 references

CloudHybridOn-PremAir-Gapped

Every recommendation in this series — about team shape, tooling, GPU strategy, or incident response — bends measurably depending on where your workloads actually run. A recommendation that is silent on deployment context almost always implicitly assumes pure-cloud: elastically provisioned compute, managed services for everything, and billing that stops when no jobs are queued. Before the series goes any further, it needs a shared vocabulary for the four contexts the rest of the articles will navigate.

This article defines those four contexts — pure-cloud, on-prem only, hybrid, and regulated / air-gapped — using the same five axes for each: GPU procurement, data residency, operational burden, vendor lock-in, and regulatory fit. It closes with a standing warning that you will see repeated throughout the series: if an article does not name a deployment context, you should assume it has been written for pure-cloud.

The four contexts

Pure-cloud

All AI/ML workloads — training, experiment tracking, model registry, serving, and monitoring — run on compute and managed services provided by a public hyperscaler (AWS, GCP, Azure) or a specialist cloud provider. The organisation owns no GPU hardware.

GPU procurement: On-demand or spot instances. Zero idle cost — scale-to-zero autoscaling eliminates the idle-GPU tax; billing stops when no jobs are queued. The per-hour rate is the premium you pay for that elasticity.

Data residency: Constrained by the hyperscaler region and the data-processing agreement you hold with them. Sufficient for most regulated industries, inadequate for workloads that must never transit a third-party network.

Operational burden: Lowest of the four contexts. The vendor manages hardware lifecycle, driver updates, and infrastructure upgrades. Your platform team focuses on configuration and product, not operations.

Vendor lock-in: Highest. Managed-servicepipeline APIs do not port across hyperscalers without substantial rewrite — a training pipeline built on one provider's managed ML service is not portable to another's.

Regulatory fit: Strong for most regulated industries when the hyperscaler holds the applicable certifications (FedRAMP High, ISO 27001, HIPAA BAA). Structurally inadequate for requirements that demand a physical air-gap or absolute prohibition on third-party data processing.

Net-new differences vs. on-prem: elasticity and per-hour pricing replace capital expenditure and amortisation; managed-service API surface accrues over time as a portability liability.

On-prem only

All AI/ML workloads run on hardware the organisation owns and operates. No hyperscaler compute is used. This may be an active cost or sovereignty decision, or an inherited constraint from an existing data-centre investment.

GPU procurement: Owned hardware, amortised over a three-to-five-year lifecycle. Idle cost is sunk — unused GPU capacity costs the same as running capacity. This makes GPU utilisation the primary cost lever: a GPU sitting at 20% utilisation is burning capital at the same rate as one at 90%.

Data residency: Full sovereignty by definition. Training data never leaves a network you control.

Operational burden: Highest of the four contexts (excluding the additional compliance overhead of air-gapped). Your team owns the full stack: Kubernetes distribution, GPU operators, driver lifecycle, storage, registry, serving runtimes, observability, and governance infrastructure. There are no managed-service fallbacks.

Vendor lock-in: Lowest. A Kubernetes-native open-source stack (scheduler, serving runtime, registry, observability) is portable across hardware and distributions.

Regulatory fit:Strong — training data never leaves the building. You provide your own audit trail rather than relying on a vendor's compliance posture.

The breakeven economics are not fixed. Lenovo's 2026 TCO analysis (LP2368) found that an 8×H100 on-premises configuration reaches breakeven against 3-year reserved cloud pricing at approximately 6,800 hours of sustained utilisation — roughly 9.3 months of continuous operation. That figure shifts with cloud pricing changes; the point is not to memorise a number but to recognise that the crossover exists and moves [4].

Net-new differences vs. pure-cloud: autoscaling is bounded by physical capacity; GPU utilisation scheduling (queue management, gang scheduling, fractional sharing) becomes essential rather than optional; the time-to-first-model for a new team is longer because there is no managed service to hand them.

Hybrid

AI/ML workloads are split across on-prem hardware and cloud compute according to cost, data-locality, and latency requirements. The most common pattern: training runs on owned GPU hardware (maximising the amortised asset) and inference runs on cloud (elastic, user-proximate, scales to zero between demand spikes).

GPU procurement: Training on owned hardware; inference on on-demand cloud nodes. The economic logic: training workloads are predictable and sustained (high utilisation of capital asset); inference workloads are spiky and user-proximate (elasticity matters more than per-hour rate).

Data residency: Training data stays on-prem. Model weights may cross the network seam to the cloud-side serving cluster. This weight-transit step must be secured and, in regulated contexts, audited.

Operational burden: High. Two infrastructure planes must be maintained. The model registry must be reachable from both sides of the network seam — either a self-hosted registry accessible from both clusters, or a registry-sync pattern using a GitOps controller. Identity must be federated: a single identity provider must issue tokens trusted by both the on-prem and cloud clusters.

Vendor lock-in: Medium. The on-prem side remains portable; the cloud-side serving stack accrues per-hyperscaler technical debt over time.

Regulatory fit:Strong for training data. Cloud-side serving depends on the hyperscaler's compliance posture for the regulation in question.

Hybridis the modal deployment posture in the industry. Flexera's 2026 State of the Cloud Report found that 73% of organisations now operate hybrid cloud estates — a 3-percentage-point increase year over year, based on a survey of 753 cloud decision-makers globally. Regulated industries skew higher toward hybrid and on-prem; the 73% figure is a cross-industry aggregate [1].

Net-new differences vs. on-prem-only: the network seam introduces a new class of failure (identity federation, registry reachability, weight-transit latency); the cost model becomes two-part (capex for training, opex for inference); the number of infrastructure planes the platform team must maintain doubles.

Regulated / air-gapped

AI/ML workloads must satisfy audit obligations or data-residency requirements that prohibit sensitive data — training data, model artefacts, or telemetry — from transiting outside a defined perimeter. The perimeter may be a physical air-gap, a contractual data-residency boundary, or a regulatory classification boundary covering personal health information, defense technical data, or classified systems.

GPU procurement: Entirely on-prem. Cloud burst requires a persistent network path to a cloud provider — that path must itself pass regulatory scrutiny. For workloads subject to ITAR, FedRAMP High, or equivalent frameworks, cloud burst is typically prohibited or requires explicit authorization that is burdensome to obtain.

Data residency: Absolute. Training data never crosses the perimeter. Model weights may cross only if they carry signed provenance, have passed a security scan, and the destination is on an approved allowlist.

Operational burden: Highest of all four contexts. Every component must appear on an approved software list and be imported through a controlled process. Container images must come from an internal registry. Python packages must come from an internal mirror. Telemetry may not be sent to any SaaS platform. SaaS-based experiment trackers, model registries, and observability backends are typically prohibited.

Vendor lock-in: Lowest for regulated data. SaaS-managed services are typically prohibited, so the stack is open-source and self-operated by default.

Regulatory fit: The only context where full compliance with the strictest data-residency requirements is structurally guaranteed. The platform IS the compliance posture.

The regulatory drivers are real and specific. The EU AI Act (Regulation EU 2024/1689) requires high-risk AI systems to maintain automatic event logs for at least six months (Article 12) and to document data provenance and governance practices for all training, validation, and testing datasets (Article 10). Sector-specific frameworks go further: ITAR requires that defense-related technical data be accessible only to US persons and remain within approved borders; FedRAMP High requires 421 security controls; HIPAA requires that protected health information be processed only in environments with appropriate safeguards and business-associate agreements [2][3][5].

Net-new differences vs. on-prem-only: every software component requires explicit vetting and import; the approved software list becomes a first-class operational artefact; telemetry perimeter adds an observability constraint (no cloud-SaaS metrics backends); the audit trail is not just useful — it is mandatory and its format may be specified by the regulator.

Context Explorer: The interactive component below lets you select one or two contexts and compare them across all five axes. Highlighted rows signal where the two contexts diverge — click any row to expand the explanation from the text above.

Context Spectrum Explorer

Select one context to explore its profile, or select two to compare axis deltas side-by-side. Highlighted rows differ between the two contexts.

Axis	Cloud
GPU Procurement How GPUs are acquired and costed	On-demand / spot
Data Residency Where data lives and who controls it	Region-bound
Operational Burden How much your team must maintain	Lowest
Vendor Lock-In Portability of your stack	Highest
Regulatory Fit Suitability for strict compliance	Strong (most)

Pure-Cloud

All workloads on hyperscaler compute. Zero owned hardware, scale-to-zero elasticity.

The spectrum is not a maturity ladder

An organisation can operate at the pure-cloud point indefinitely and at high maturity. Moving toward the air-gapped end is a response to regulatory constraints or cost economics — not a sign of sophistication or operational advancement. The most common migration path is pure-cloud to hybrid, triggered when owned GPU hardware crosses the per-hour-rate crossover point or when training-data residency requirements tighten.

Each move along the spectrum adds operational burden. Pure-cloud to hybrid adds a network seam, a second infrastructure plane, and a federated identity requirement. Hybrid to on-prem-only removes the elastic cloud fallback. On-prem-only to air-gapped adds a perimeter, an approved software list, and a mandatory audit trail. These are not free upgrades — each transition requires platform team headcount, tooling investment, and process discipline that the previous context did not demand.

The inverse migration — workload repatriationfrom cloud to on-prem — is economically rational when sustained GPU utilisation exceeds the breakeven threshold. Lenovo's 2026 analysis places that threshold at roughly 6,800 hours for 8×H100 hardware — approximately 9.3 months of continuous operation. Below that threshold, cloud procurement is cheaper; above it, owned hardware amortises more cheaply than reserved-instance pricing [4].

Practical Implication: For teams planning a move from pure-cloud to hybrid: the operational cost of the second infrastructure plane (two control planes, federated identity, registry sync) typically equals or exceeds two to three platform engineers full-time. That headcount cost must be included in the TCO calculation alongside the hardware savings.

What bends across contexts: a five-axis summary

The table below is the shared reference vocabulary the rest of the series uses. When a later article says “this recommendation changes in on-prem and air-gapped contexts,” these are the axes it is describing.

Axis	Pure-cloud	On-prem only	Hybrid	Air-gapped
GPU procurement	On-demand / spot	Owned, amortised	Split: owned + cloud	All owned, no burst
Data residency	Region-bound	Full sovereignty	Split: training on-prem	Absolute perimeter
Operational burden	Low	High	High (two planes)	Highest
Vendor lock-in	High	Low	Medium	Lowest
Regulatory fit	Strong (most)	Strong	Strong for training	Only guaranteed

The default-to-pure-cloud warning

The majority of publicly available MLOps content — tooling documentation, blog posts, architecture reference guides — is written by and for teams operating in a pure-cloud context. This is not a criticism: pure-cloud is where most new AI/ML work starts, and cloud-hosted managed services genuinely reduce the operational barrier to getting a first model into production.

The problem arises when pure-cloud-written recommendations are applied without adjustment to on-prem, hybrid, or air-gapped contexts. Specific failure patterns that recur:

›Autoscaling guidance assumes scale-to-zero is available. On-prem, capacity is fixed — autoscaling means queue management, not node provisioning.
›Cost optimisation guidance assumes billing is per-unit-time. On-prem, the cost structure is capex plus operational headcount — utilisation, not billing, is the lever.
›Observability guidance assumes you can send telemetry to a managed backend. Air-gapped contexts require a self-hosted observability stack — every SaaS metric backend is outside the perimeter.
›Model registry guidance assumes SaaS registry access. Hybrid and air-gapped contexts require self-hosted registries reachable from both sides of a network seam, or a registry-sync pattern.
›GPU sharing guidance may assume cloud-specific managed node pools. On-prem, fractional GPU scheduling (time-slicing, MIG, HAMi) is the only utilisation lever — there is no managed equivalent.

Throughout this series, articles that differ meaningfully across deployment contexts will say so explicitly. When an article covers tooling, team structure, GPU strategy, or observability patterns without naming a context, you should read it as written for pure-cloud and apply your own adjustment for the axes above.

Which recommendations break in your context?

The auditor below makes the failure patterns concrete. Select your deployment context; it flags which common architecture recommendations silently assume pure-cloud and how they break — drawn directly from the text above.

Assumption Auditor

Select your deployment context. The auditor flags which common recommendations silently assume pure-cloud and how they break in your context.

Pure-Cloud

✓6 work as-is⚠0 need adjustment✕0 break silently

Recommendations rated ✕ silently assume pure-cloud. Using them in other contexts without adjustment is a primary source of hidden operational debt.

How this vocabulary is used in the rest of the series

The ML lifecycle article (Part 2) notes where lifecycle-stage tooling diverges for on-prem contexts. The GPU scheduling articles (Part 7) are written primarily for on-prem and hybrid contexts, where GPU utilisation is the primary cost lever. The LLMOps articles (Part 5) note where telemetry and observability differ for air-gapped deployments. The governance and responsible AI article (Part 8) uses the EU AI Act as the worked example for air-gapped regulatory obligations.

If you are operating exclusively in pure-cloud and plan to stay there, the four-context vocabulary still matters: it is the mental model that lets you read on-prem-targeted content and correctly identify which parts apply to you (the patterns) and which do not (the operational specifics).

Standing warning for this series: If an article does not name a deployment context, you should assume it has been written for pure-cloud and apply your own adjustment for the five axes above.

References

[1] Flexera. “2026 State of the Cloud Report.” Flexera Research, 2026. Survey of 753 cloud decision-makers globally. 73% hybrid cloud figure.
[2] EU AI Act. “Article 10 — Data and Data Governance.” European Parliament and Council, Regulation (EU) 2024/1689. Requires provenance and governance documentation for all training, validation, and testing datasets used in high-risk AI systems.
[3] EU AI Act. “Article 12 — Record-Keeping.” European Parliament and Council, Regulation (EU) 2024/1689. High-risk AI systems must generate automatic event logs retained for at least six months.
[4] Lenovo. “On-Premise vs Cloud: Generative AI Total Cost of Ownership (2026 Edition).” Lenovo Press, LP2368, February 2026. Breakeven at ~6,800 hours (~9.3 months) against 3-year reserved cloud pricing for 8×H100-class hardware.
[5] Paramify. “FedRAMP vs. ITAR: Key Differences and Compliance Considerations.” 2024. ITAR data residency and US-persons requirements for defense technical data.
[6] Expanso. “Data Residency Requirements: A Complete Guide for Distributed Teams.” 2024. Overview of HIPAA, GDPR, ITAR, and FedRAMP data-residency constraints.

Continue the Journey

Interactive