Where this goes next — the open problems on a 2026 AI platform

Six gaps that define the frontier of AI platform engineering in 2026
This is the last article in the series. The previous 33 articles built a working vocabulary from first principles — what MLOps is, what an AI platform team owns, how training and serving workloads differ, how LLMOps extends the classical lifecycle, how Kubernetes schedules GPU workloads, and how Dynamic Resource Allocation rewrites the scheduler contract. This article does not introduce new concepts. It names the problems the series could not resolve.
Each gap below is open in a specific sense: a published source establishes the problem, no production-grade general solution exists as of mid-2026, and the gap has direct operational consequences for a platform team. These are not speculative research directions — they are the places where platform teams currently write custom code, ship workarounds, or accept known technical debt. Naming them is more useful than leaving them as undocumented assumptions.
Gap 1 — Eval harnesses do not emit confidence intervals
The article on eval as a test suite established that LLM-as-judge evaluation is structurally probabilistic: the same input can receive different scores on different evaluation runs because judge LLMs are sampled, not deterministic. This creates a requirement for confidence interval tracking on every eval run — not just a mean score, but a mean and a variance.
As of mid-2026, no widely adopted open-source eval harness emits confidence intervals as a first-class output. Tools such as Promptfoo, DeepEval, and Inspect AI report mean scores; they do not natively report variance across multiple judge-evaluation runs for the same test case. The Judge Reliability Harness (RAND, 2026) is an early research prototype that aggregates pass rates, confidence intervals, and cost curves into standardised reports — but it is a validation tool for judges themselves, not a drop-in replacement for a CI eval harness. The practical implication for platform teams: run each evaluation prompt N times (typically three to five) and compute mean ± standard deviation manually, accepting the additional inference cost.
The gap has a second dimension. Conformal prediction methods have been proposed to produce distribution-free confidence sets for LLM outputs, but these require held-out calibration sets and are not yet integrated into any major eval framework. Until confidence intervals are a first-class output, eval pass/fail gates in CI are single-sample estimates masquerading as measurements.
Gap 2 — Agent runtime standards are fragmented
The article on the MLOps-to-LLMOps seam named agent runtime as one of the genuinely new LLMOps concerns. The problem is that three distinct protocols now compete for the same runtime surface, and production agent systems must implement multiple of them or accept fragmentation.
The Model Context Protocol (Anthropic, 2024) standardises tool invocation: how an agent discovers and calls external tools through authenticated, schema-based interfaces. The Agent-to-Agent protocol (Google, 2025) standardises agent-to-agent delegation: how one agent instructs another to execute a subtask. The Agent Communication Protocol (Linux Foundation, 2025) standardises agent state and message passing for longer-running workflows. These protocols are complementary in design but not yet unified in practice — a production multi-agent pipeline must bridge across all three if the components originate from different ecosystems.
A 2025 survey of AI agent protocols catalogued the fragmentation and noted that no single protocol covers the full surface of tool invocation, agent delegation, and persistent state management. The convergence path is active research. For platform teams, the practical consequence is that agent observability — tracing a request through a multi-agent graph — requires instrumenting multiple protocol boundaries, each with different tracing conventions.
Gap 3 — GPU-side data provenance is not yet attestable end-to-end
The article on supply chain security for ML covered model artefact signing and SBOM generation. The deeper open problem is one level back: the provenance of the data and compute that produced the model artefact, specifically on the GPU-accelerated training side.
The SLSA framework established supply-chain levels for software build provenance, and work such as the Atlas framework (Intel Labs, 2025) proposes extending SLSA and in-toto specifications to ML pipeline operations — capturing verifiable records of which datasets, transformations, and training runs contributed to a given model checkpoint. The gap is that GPU-executed training steps are difficult to attest: the training process is a long-running, stateful computation across many accelerators, not a discrete build step with a deterministic output.
Two sub-problems compound this. First, non-determinism in distributed training (due to floating-point accumulation order, NCCL all-reduce scheduling, and gradient checkpointing decisions) means the same dataset and code do not reproducibly produce the same checkpoint — so a hash of inputs does not reliably predict a hash of outputs. Second, current ML observability tooling captures training metrics (loss curves, throughput) but not the provenance chain regulators and auditors need: which exact dataset version, which data processing code at which commit, which hardware topology, which random seeds. Atlas is a research proposal; as of mid-2026 there is no production-hardened tool that generates SLSA-level attestations for GPU training runs.
Gap 4 — Regulator-readable lineage does not yet exist as a product
The article on governance, lineage, and model cards described what a mature governance layer looks like. The operational gap is between what regulators require and what current tooling produces.
Article 12 of the EU AI Act (Regulation (EU) 2024/1689) requires providers of high-risk AI systems to implement automatic logging that allows regulators to verify that systems operated in accordance with their intended purpose. A 2026 review of audit trail requirements for LLMs identified a persistent gap: current ML observability tooling produces human-readable dashboards and alert histories, not tamper-evident, machine-readable logs that connect a specific runtime decision to the governing policy and the model version that produced it. Regulators wanting to verify conformity assessments need logs they can query; what most platforms produce is logs they can visualise.
The practical consequence is that regulated-industry platform teams must build a bespoke lineage layer that maps from MLflow run IDs, Kubernetes pod logs, and inference server access logs into a structured, timestamped, append-only record that a compliance team can hand to an auditor. This is hand-rolled at every organisation that has done it. It is the most expensive undifferentiated work in regulated AI platform engineering, and it will remain so until either a framework ships this as a first-class feature or the regulatory community specifies a machine-readable format that tooling vendors can target.
Gap 5 — The device-plugin-to-DRA migration has no production landing yet
The article on DRA and the future of GPU scheduling described the architecture shift from the legacy device plugin model to Dynamic Resource Allocation (DRA). The Kubernetes API reached GA in v1.34 (September 2025). The gap is on the driver side.
NVIDIA's DRA driver for Kubernetes (k8s-dra-driver-gpu) states in its README that GPU allocation features "can be tried out" but "are not yet officially supported". DRA support for NVIDIA Inference Microservices (NIM) is classified as Technology Preview — not suitable for production deployment. The result is a standards gap: the Kubernetes API is stable, the protocol is defined, but the dominant GPU vendor's production driver does not yet implement it fully.
Platform teams face a choice: migrate to DRA now and accept driver-level risk, or stay on device plugins and accept that the Kubernetes scheduler improvements DRA enables (topology-aware allocation, partitionable devices, MIG dynamic repartition) are unavailable. The second option is the dominant one in production clusters as of mid-2026. This gap will close when the vendor driver reaches supported status and the Kueue DRA integration (currently at alpha for DRAExtendedResources) graduates — but the timeline is not fixed. The practical recommendation in earlier articles in this series stands: adopt DRA in staging for new workload types; do not migrate existing device-plugin-based production pipelines until driver support is official.
Gap 6 — Multi-cluster GPU quota federation is not yet closed
The article on multi-tenancy and fairness covered per-cluster quota management with Kueue. The open problem is one level above: borrowing quota across clusters so a job admitted to one cluster can draw on idle GPU capacity in another.
Kueue's MultiKueue feature promoted batch job dispatching to GA in February 2026 — a meaningful step. However, the broader capability of multi-cluster quota borrowing (admitting a job in cluster A against quota headroom in cluster B) remains a 2026 roadmap priority, not a shipped stable feature. The Kueue project's 2026 plans list "improve user experience for MultiKueue" as a high-level priority, which signals active development but not a closed gap.
For hybrid platform architectures — on-prem GPU clusters for long training runs alongside cloud-managed Kubernetes for burst inference — the inability to pool quota across the boundary means each cluster has its own admission controller and its own queue backlog. Unused GPU capacity in one cluster cannot be borrowed by a job waiting in another. Organisations work around this with cluster-level federation (a separate service that watches queue depths and reschedules jobs across clusters), but this is bespoke automation. The gap closes when MultiKueue ships stable cross-cluster quota borrowing with documented failure semantics.
The thread connecting these gaps
Each of these six gaps shares a structural feature: the standards layer exists, or is emerging, but the production implementation lags. DRA's Kubernetes API is stable; the GPU driver is not. The EU AI Act's logging requirements are law; the tooling that satisfies them is not. Agent protocols are specified; a unified runtime is not. Eval harnesses are mature; confidence interval reporting is not. This is not an unusual position for a fast-moving platform discipline — it is the position every serious systems discipline occupies at some point in its maturity curve.
The implication for a platform team building in 2026 is not that these areas should be avoided — it is that they should be approached with explicit gap accounting. Where a standard exists and an implementation gap remains, the right response is to instrument around the gap (write the lineage layer, run the eval prompts N times, keep device plugins in production while DRA matures in staging) rather than to wait. The gaps will close. The platform team's job is to build a bridge to that closure, not to wait for it.
Recommended reading
Five works that will reward a reader who has finished this series and wants to go deeper into the open problems:
- Sculley et al., Hidden Technical Debt in Machine Learning Systems (NeurIPS, 2015). The foundational taxonomy of ML-specific system debt. Every gap in this article is a new instantiation of the same structural problem: the boundary between an ML system and its environment is harder to define than the boundary between a software system and its inputs.
- Paleyes, Urma, and Lawrence, Challenges in Deploying Machine Learning: A Survey of Case Studies (ACM Computing Surveys, 2022). The empirical case-study survey across industries that puts deployment failure modes on a systematic footing. The provenance and lineage gaps in this article trace directly to their "data management" and "model monitoring" failure categories.
- Kreuzberger et al., A Multivocal Review of MLOps Practices, Challenges and Open Issues (ACM Computing Surveys, 2025). The most current systematic review of MLOps practices as of mid-2026. Useful as a cross-reference for any of the operational gaps in this article — the review's open-issues section maps directly to the implementation lags named here.
- CNCF Cloud Native AI Working Group, Cloud Native AI Whitepaper (2024). The community's own diagnosis of where cloud-native tooling does not yet meet AI workload requirements. GPU allocation and sharing, observability for AI, and data management at training scale are all named as evolving areas — matching the DRA, eval, and provenance gaps here.
- Spoczynski, Melara, and Szyller (Intel Labs), Atlas: A Framework for ML Lifecycle Provenance and Transparency (arXiv:2502.19567, 2025). The most complete proposal for extending SLSA-level supply-chain attestation to ML training pipelines. Essential reading for anyone attempting to close Gap 3 or Gap 4 in their platform.
References
- Dev, S., Sloan, A., Kavner, J., Kong, N., and Sandler, M. Judge Reliability Harness: Stress Testing the Reliability of LLM Judges. RAND Corporation / arXiv:2603.05399, March 2026.
- Zheng, L. et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. arXiv:2306.05685.
- Wang, L. et al. A Survey of AI Agent Protocols. arXiv:2504.16736, April 2025.
- Model Context Protocol Blog. One Year of MCP: November 2025 Spec Release. Anthropic, November 2025.
- Spoczynski, M., Melara, M. S., and Szyller, S. Atlas: A Framework for ML Lifecycle Provenance and Transparency. Intel Labs / arXiv:2502.19567, February 2025.
- Anonymous authors. Audit Trails for Accountability in Large Language Models. arXiv:2601.20727, January 2026.
- Kubernetes Project. Kubernetes v1.34: DRA has graduated to GA. Kubernetes Blog, September 2025.
- NVIDIA. k8s-dra-driver-gpu README. GitHub, 2025–2026.
- Kubernetes SIGs. MultiKueue Concepts. Kueue Documentation, 2026.
- Sculley, D. et al. Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015, pp. 2503–2511.
- Paleyes, A., Urma, R.-G., and Lawrence, N. D. Challenges in Deploying Machine Learning: A Survey of Case Studies. ACM Computing Surveys, Vol. 55, No. 6, Article 114, December 2022.
- Kreuzberger, D. et al. A Multivocal Review of MLOps Practices, Challenges and Open Issues. ACM Computing Surveys, 2025. DOI: 10.1145/3747346.
- CNCF Cloud Native AI Working Group. Cloud Native AI Whitepaper. CNCF TAG Runtime, March 2024.
Tags
About the Author

asleekgeek
Senior Developer, Architect, DevOps
Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist
Thanks for reading! Explore more articles.
Back to Articles