AI Platform Engineering & MLOps · Part XXXIV of 34

Where this goes next — the open problems on a 2026 AI platform

Six open problems that no production AI platform has fully solved in 2026: eval confidence intervals, agent runtime standards, GPU-side provenance, regulator-readable lineage, the DRA migration, and multi-cluster quota federation.

12 min read·2 interactive components·13 references·Final article — Part 34 of 34

Eval confidence intervalsAgent runtime standardsGPU-side provenanceRegulator-readable lineageDRA migrationMulti-cluster quota federation

This is the last article in the series. The previous 33 articles built a working vocabulary from first principles — what MLOps is, what an AI platform team owns, how training and serving workloads differ, how LLMOps extends the classical lifecycle, how Kubernetes schedules GPU workloads, and how Dynamic Resource Allocation rewrites the scheduler contract. This article does not introduce new concepts. It names the problems the series could not resolve.

Each gap below is open in a specific sense: a published source establishes the problem, no production-grade general solution exists as of mid-2026, and the gap has direct operational consequences for a platform team. These are not speculative research directions — they are the places where platform teams currently write custom code, ship workarounds, or accept known technical debt. Naming them is more useful than leaving them as undocumented assumptions.

Gap 1 — Eval harnesses do not emit confidence intervals

The article on eval as a test suite established that LLM-as-judge evaluation is structurally probabilistic: the same input can receive different scores on different evaluation runs because judge LLMs are sampled, not deterministic. This creates a requirement for confidence interval tracking on every eval run — not just a mean score, but a mean and a variance.

As of mid-2026, no widely adopted open-source eval harness emits confidence intervals as a first-class output. Tools such as Promptfoo, DeepEval, and Inspect AI report mean scores; they do not natively report variance across multiple judge-evaluation runs for the same test case. The Judge Reliability Harness (RAND, 2026) is an early research prototype that aggregates pass rates, confidence intervals, and cost curves into standardised reports — but it is a validation tool for judges themselves, not a drop-in replacement for a CI eval harness. The practical implication for platform teams: run each evaluation prompt N times (typically three to five) and compute mean ± standard deviation manually, accepting the additional inference cost.

The gap has a second dimension. Conformal prediction methods have been proposed to produce distribution-free confidence sets for LLM outputs, but these require held-out calibration sets and are not yet integrated into any major eval framework. Until confidence intervals are a first-class output, eval pass/fail gates in CI are single-sample estimates masquerading as measurements.

Gap 2 — Agent runtime standards are fragmented

The article on the MLOps-to-LLMOps seam named agent runtime as one of the genuinely new LLMOps concerns. The problem is that three distinct protocols now compete for the same runtime surface, and production agent systems must implement multiple of them or accept fragmentation.

The Model Context Protocol (Anthropic, 2024) standardises tool invocation: how an agent discovers and calls external tools through authenticated, schema-based interfaces. The Agent-to-Agent protocol (Google, 2025) standardises agent-to-agent delegation: how one agent instructs another to execute a subtask. The Agent Communication Protocol (Linux Foundation, 2025) standardises agent state and message passing for longer-running workflows. These protocols are complementary in design but not yet unified in practice — a production multi-agent pipeline must bridge across all three if the components originate from different ecosystems.

A 2025 survey of AI agent protocols catalogued the fragmentation and noted that no single protocol covers the full surface of tool invocation, agent delegation, and persistent state management. The convergence path is active research. For platform teams, the practical consequence is that agent observability — tracing a request through a multi-agent graph — requires instrumenting multiple protocol boundaries, each with different tracing conventions.

Gap 3 — GPU-side data provenance is not yet attestable end-to-end

The article on supply chain security for ML covered model artefact signing and SBOM generation. The deeper open problem is one level back: the provenance of the data and compute that produced the model artefact, specifically on the GPU-accelerated training side.

The SLSA framework established supply-chain levels for software build provenance, and work such as the Atlas framework (Intel Labs, 2025) proposes extending SLSA and in-toto specifications to ML pipeline operations — capturing verifiable records of which datasets, transformations, and training runs contributed to a given model checkpoint. The gap is that GPU-executed training steps are difficult to attest: the training process is a long-running, stateful computation across many accelerators, not a discrete build step with a deterministic output.

Two sub-problems compound this. First, non-determinism in distributed training (due to floating-point accumulation order, NCCL all-reduce scheduling, and gradient checkpointing decisions) means the same dataset and code do not reproducibly produce the same checkpoint — so a hash of inputs does not reliably predict a hash of outputs. Second, current ML observability tooling captures training metrics (loss curves, throughput) but not the provenance chain regulators and auditors need: which exact dataset version, which data processing code at which commit, which hardware topology, which random seeds. Atlas is a research proposal; as of mid-2026 there is no production-hardened tool that generates SLSA-level attestations for GPU training runs.

Gap 4 — Regulator-readable lineage does not yet exist as a product

The article on governance, lineage, and model cards described what a mature governance layer looks like. The operational gap is between what regulators require and what current tooling produces.

Article 12 of the EU AI Act (Regulation (EU) 2024/1689) requires providers of high-risk AI systems to implement automatic logging that allows regulators to verify that systems operated in accordance with their intended purpose. A 2026 review of audit trail requirements for LLMs identified a persistent gap: current ML observability tooling produces human-readable dashboards and alert histories, not tamper-evident, machine-readable logs that connect a specific runtime decision to the governing policy and the model version that produced it. Regulators wanting to verify conformity assessments need logs they can query; what most platforms produce is logs they can visualise.

The practical consequence is that regulated-industry platform teams must build a bespoke lineage layer that maps from MLflow run IDs, Kubernetes pod logs, and inference server access logs into a structured, timestamped, append-only record that a compliance team can hand to an auditor. This is hand-rolled at every organisation that has done it. It is the most expensive undifferentiated work in regulated AI platform engineering, and it will remain so until either a framework ships this as a first-class feature or the regulatory community specifies a machine-readable format that tooling vendors can target.

Gap 5 — The device-plugin-to-DRA migration has no production landing yet

The article on DRA and the future of GPU scheduling described the architecture shift from the legacy device plugin model to Dynamic Resource Allocation (DRA). The Kubernetes API reached GA in v1.34 (September 2025). The gap is on the driver side.

NVIDIA’s DRA driver for Kubernetes (k8s-dra-driver-gpu) states in its README that GPU allocation features “can be tried out” but “are not yet officially supported”. DRA support for NVIDIA Inference Microservices (NIM) is classified as Technology Preview — not suitable for production deployment. The result is a standards gap: the Kubernetes API is stable, the protocol is defined, but the dominant GPU vendor’s production driver does not yet implement it fully.

Platform teams face a choice: migrate to DRA now and accept driver-level risk, or stay on device plugins and accept that the Kubernetes scheduler improvements DRA enables (topology-aware allocation, partitionable devices, MIG dynamic repartition) are unavailable. The second option is the dominant one in production clusters as of mid-2026. This gap will close when the vendor driver reaches supported status and the Kueue DRA integration (currently at alpha for DRAExtendedResources) graduates — but the timeline is not fixed. The practical recommendation in earlier articles in this series stands: adopt DRA in staging for new workload types; do not migrate existing device-plugin-based production pipelines until driver support is official.

Gap 6 — Multi-cluster GPU quota federation is not yet closed

The article on multi-tenancy and fairness covered per-cluster quota management with Kueue. The open problem is one level above: borrowing quota across clusters so a job admitted to one cluster can draw on idle GPU capacity in another.

Kueue’s MultiKueue feature promoted batch job dispatching to GA in February 2026 — a meaningful step. However, the broader capability of multi-cluster quota borrowing(admitting a job in cluster A against quota headroom in cluster B) remains a 2026 roadmap priority, not a shipped stable feature. The Kueue project’s 2026 plans list “improve user experience for MultiKueue” as a high-level priority, which signals active development but not a closed gap.

For hybrid platform architectures — on-prem GPU clusters for long training runs alongside cloud-managed Kubernetes for burst inference — the inability to pool quota across the boundary means each cluster has its own admission controller and its own queue backlog. Unused GPU capacity in one cluster cannot be borrowed by a job waiting in another. Organisations work around this with cluster-level federation (a separate service that watches queue depths and reschedules jobs across clusters), but this is bespoke automation. The gap closes when MultiKueue ships stable cross-cluster quota borrowing with documented failure semantics.

The explorer below presents each gap with its status, why it’s hard, and what would signal closure — drawn directly from the article above.

Open Problems Explorer

Select a gap to see why it’s hard and what would signal closure.

Open

No widely adopted open-source eval harness emits confidence intervals as a first-class output as of mid-2026.

Gap 1 — Eval harnesses do not emit confidence intervals

LLM-as-judge evaluation is structurally probabilistic — the same input can receive different scores on different runs because judge LLMs are sampled, not deterministic. Conformal prediction methods can produce distribution-free confidence sets but require held-out calibration sets not yet integrated into any major eval framework.

Established in:

Eval as a Test Suite (Part 18)

The thread connecting these gaps

Each of these six gaps shares a structural feature: the standards layer exists, or is emerging, but the production implementation lags. DRA’s Kubernetes API is stable; the GPU driver is not. The EU AI Act’s logging requirements are law; the tooling that satisfies them is not. Agent protocols are specified; a unified runtime is not. Eval harnesses are mature; confidence interval reporting is not. This is not an unusual position for a fast-moving platform discipline — it is the position every serious systems discipline occupies at some point in its maturity curve.

The implication for a platform team building in 2026 is not that these areas should be avoided — it is that they should be approached with explicit gap accounting. Where a standard exists and an implementation gap remains, the right response is to instrument around the gap (write the lineage layer, run the eval prompts N times, keep device plugins in production while DRAmatures in staging) rather than to wait. The gaps will close. The platform team’s job is to build a bridge to that closure, not to wait for it.

The six open gaps are each a new instantiation of the structural problem identified by Sculley et al. (NeurIPS 2015): the boundary between an ML system and its environment is harder to define than the boundary between a software system and its inputs. Every gap in this article is that boundary problem in a new domain — eval, agents, provenance, lineage, scheduling, federation. Naming the boundary is the first step to closing it.

The map below presents all 34 parts of the series grouped by theme. Each card links to its article — a full recap of the ground covered.

Series Journey Map

All 34 parts grouped by theme. Select a group to browse the articles.

Foundations

What MLOps is, who owns it, and how teams are structured.

Series progress6 / 34

References

[1] Dev, S., Sloan, A., Kavner, J., Kong, N., and Sandler, M. Judge Reliability Harness: Stress Testing the Reliability of LLM Judges. RAND Corporation / arXiv:2603.05399, March 2026.
[2] Zheng, L. et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. arXiv:2306.05685.
[3] Wang, L. et al. A Survey of AI Agent Protocols. arXiv:2504.16736, April 2025.
[4] Model Context Protocol Blog. One Year of MCP: November 2025 Spec Release. Anthropic, November 2025.
[5] Spoczynski, M., Melara, M. S., and Szyller, S. Atlas: A Framework for ML Lifecycle Provenance and Transparency. Intel Labs / arXiv:2502.19567, February 2025.
[6] Anonymous authors. Audit Trails for Accountability in Large Language Models. arXiv:2601.20727, January 2026.
[7] Kubernetes Project. Kubernetes v1.34: DRA has graduated to GA. Kubernetes Blog, September 2025.
[8] NVIDIA. k8s-dra-driver-gpu README. GitHub, 2025–2026.
[9] Kubernetes SIGs. MultiKueue Concepts. Kueue Documentation, 2026.
[10] Sculley, D. et al. Hidden Technical Debt in Machine Learning Systems. NeurIPS 2015, pp. 2503–2511.
[11] Paleyes, A., Urma, R.-G., and Lawrence, N. D. Challenges in Deploying Machine Learning: A Survey of Case Studies. ACM Computing Surveys, Vol. 55, No. 6, Article 114, December 2022.
[12] Kreuzberger, D. et al. A Multivocal Review of MLOps Practices, Challenges and Open Issues. ACM Computing Surveys, 2025. DOI: 10.1145/3747346.
[13] CNCF Cloud Native AI Working Group. Cloud Native AI Whitepaper. CNCF TAG Runtime, March 2024.

Continue the Journey

AI Platform

Where this goes next — the open problems on a 2026 AI platform

Gap 1 — Eval harnesses do not emit confidence intervals

Gap 2 — Agent runtime standards are fragmented

Gap 3 — GPU-side data provenance is not yet attestable end-to-end

Gap 4 — Regulator-readable lineage does not yet exist as a product

Gap 5 — The device-plugin-to-DRA migration has no production landing yet

Gap 6 — Multi-cluster GPU quota federation is not yet closed

Open Problems Explorer

Gap 1 — Eval harnesses do not emit confidence intervals

The thread connecting these gaps

Series Journey Map

Foundations

Recommended reading

References

Continue the Journey

What is MLOps in 2026? A defensible working definition

AI Platform maturity — five levels and the single move that unlocks each

DRA and the future of GPU scheduling

MLOps vs LLMOps — the 60/40 seam