Dynamic Resource Allocation: What Changes When Devices Become First-Class

Device-plugin vs DRA scheduling paths — the allocation model shifts from opaque integer counters to structured, scheduler-visible claims.
For five years, the device-plugin model was how Kubernetes surfaces GPUs to workloads. A DaemonSet on each node registers an integer counter — say, nvidia.com/gpu: 4 — and the scheduler deducts from that count when a pod is placed. Simple to implement, simple to understand, and — at scale — demonstrably insufficient. Four structural ceilings are baked into the model by design, not accident.
Dynamic Resource Allocation (DRA) is the Kubernetes mechanism designed to replace that model. It reached stable (GA) in Kubernetes v1.34 — released August 2025 — under the resource.k8s.io/v1 API group, enabled by default with no feature gate required [1].
This article walks through why DRA exists, what the API looks like, how the scheduling path changes, and which parts of the ecosystem have caught up. If you have read the earlier articles in this series on gang scheduling and GPU sharing mechanisms, this is where those pieces connect to the Kubernetes roadmap.
The four ceilings of the device-plugin model
Understanding what DRA replaces is a prerequisite for understanding what it does. The device-plugin model hits four hard limits.
1. Integer-only requests
Extended resources must be whole integers. A pod requests nvidia.com/gpu: 1 or nvidia.com/gpu: 2 — never 0.5 or a memory-bound fraction. Workarounds like MIG slices or time-slicing expose fractional capacity by pre-advertising named integer resources (e.g. nvidia.com/mig-3g.40gb), but the granularity is fixed at configuration time. Workloads cannot negotiate at scheduling time — they must fit one of the pre-cut profiles.
2. No topology in the request
A pod requesting two GPUs cannot express "give me two GPUs connected by NVLink on the same node." That constraint must be encoded in node labels and affinity rules authored by the workload owner — who has to know the topology of the cluster ahead of time. The device plugin has no path to communicate fabric topology to the scheduler before allocation decisions are made.
3. Opaque allocation
The scheduler sees integer counts in Allocatable but not which physical devices are currently free. Cluster Autoscaler cannot simulate device-plugin allocation to decide whether a pending pod would fit on a hypothetical new node — it has to guess or assume. KEP-4381's motivation section states explicitly that a goal of structured parameters is to allow the scheduler to "handle and Cluster Autoscaler to simulate claim allocation themselves without relying on a third-party driver" [2].
4. No dynamic repartitioning
Changing a MIG profile requires draining the node, reconfiguring the partition, and restarting the plugin. Partitioning is static relative to the scheduling loop. A workload that needs a 2g.20gb slice on a node currently partitioned into 1g.10gb slices cannot be accommodated at scheduling time — the cluster admin must proactively match partition strategy to anticipated workload shapes, or lose utilisation.
A short history of two KEPs
The DRA initiative produced two distinct designs. KEP-3063 introduced the original "control-plane controller" model (alpha in Kubernetes 1.26): a third-party driver handled all claim allocation via API callbacks, keeping allocation logic entirely opaque to the scheduler. This preserved Cluster Autoscaler's inability to simulate allocations — the same structural problem as device plugins, moved one layer up. KEP-3063 was withdrawn as the primary path in Kubernetes 1.32 [3].
KEP-4381 ("structured parameters") reversed the roles: the driver publishes structured capability data into ResourceSlice objects in the API server, and the scheduler itself performs allocation using CEL expressions from the claim. No driver callback at scheduling time — the scheduler can reason about devices entirely from first-party API objects. KEP-4381 shipped beta (v1beta1) in Kubernetes 1.32, added v1beta2 in 1.33, and reached stable as resource.k8s.io/v1 in Kubernetes 1.34 [2]. When documentation refers to "DRA" today, it means the KEP-4381 model.
The DRA object model: four API types
Four types compose the resource.k8s.io/v1 surface, each with a distinct role in the allocation lifecycle:
- ResourceSlice — created and maintained by the DRA driver (typically a DaemonSet on each node). Describes the devices a node offers: their attributes (GPU model, memory, NVLink topology, MIG capability), driver reference, and pool membership. The scheduler reads ResourceSlice objects directly — no callback to a driver is needed to decide whether a device fits a claim.
- DeviceClass — cluster-level object authored by the admin or installed by the driver. Acts as a selector: "any GPU managed by the gpu.nvidia.com driver". Workloads reference a DeviceClass to avoid hard-coding driver names into job manifests.
- ResourceClaim — the actual allocation request. Specifies what is needed (e.g. "one GPU with >= 40 GB memory from the nvidia-gpu DeviceClass") using CEL selector expressions. The scheduler resolves the claim against available ResourceSlices and marks which device is reserved. A ResourceClaim can be shared across pods in a workload — for example, all workers in a distributed training job can reference the same claim.
- ResourceClaimTemplate — a pod-level template that generates a per-pod ResourceClaim automatically, mirroring the PersistentVolumeClaim template pattern. Used when each pod needs its own exclusive device allocation.
A minimal DeviceClass and ResourceClaim look like this:
# resource.k8s.io/v1 — Kubernetes >= 1.34 (DRA GA, enabled by default)
apiVersion: resource.k8s.io/v1
kind: DeviceClass
metadata:
name: nvidia-gpu
spec:
selectors:
- cel:
expression: "device.driver == 'gpu.nvidia.com'"# ResourceClaim — memory-bound fractional request (no integer ceiling)
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: training-gpu
namespace: team-research
spec:
devices:
requests:
- name: gpu
deviceClassName: nvidia-gpu
selectors:
- cel:
expression: "device.attributes['gpu.nvidia.com'].memory >= 40*1024*1024*1024"# Pod referencing the claim — note spec.resourceClaims, not resources.limits
apiVersion: v1
kind: Pod
metadata:
name: training-worker
spec:
resourceClaims:
- name: gpu-claim
resourceClaimName: training-gpu
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:25.03-py3
resources:
claims:
- name: gpu-claimHow the scheduling path changes: device plugin vs DRA
The two models differ most at the point where the scheduler decides whether a pod fits a node. In the device-plugin model that decision is made with incomplete information; in DRA it is made with full structured visibility.
Device-plugin path (current):
- Node device plugin reports integer count to kubelet.
- Kubelet reflects count in Node.Status.Allocatable.
- Scheduler subtracts from integer counter during filter/score. No device attributes visible.
- Kubelet calls device plugin Allocate() at pod start; plugin selects specific device(s) — scheduler never learns which physical device was used.
DRA path (Kubernetes >= 1.34):
- DRA driver publishes ResourceSlice objects (per node, per device) with full attribute data into the API server.
- Workload author creates a ResourceClaim with CEL selectors expressing actual needs (memory, topology, MIG capability).
- Scheduler evaluates CEL expressions against ResourceSlice data inline — no driver round-trip. Marks the specific device reserved in the ResourceClaim status.
- DRA kubelet plugin reads the resolved ResourceClaim and performs device setup (CDI annotations, environment variables) at pod start.
- Cluster Autoscaler can simulate step 3 against virtual ResourceSlice projections — enabling correct scale-out decisions for pending DRA workloads.
The structural improvement is that allocation decisions and the data needed to make them live in the same API layer. The scheduler is no longer reasoning about a proxy (an integer counter) for a physical reality it cannot see.
What is stable in v1.34, what is still maturing
- Stable in v1.34 (GA, enabled by default): ResourceClaim, ResourceClaimTemplate, DeviceClass, ResourceSlice core objects; basic CEL device selectors; per-claim and shared-claim allocation; ResourceClaim reference in pod specs.
- Alpha in v1.34, beta in v1.36: DRAConsumableCapacity (capacity-aware scheduling); KEP-4815 Partitionable Devices (request a sub-partition of a device, e.g. MIG slice, at claim time) [5]; Device Taints (taint individual devices in a ResourceSlice, analogous to node taints) [4].
- GA in v1.36: Admin Access (privileged claim for device-admin workloads like monitoring DaemonSets); Prioritized List (ordered device preference in a single claim request) [4].
- Ecosystem integration — alpha / in-progress as of v1.34: Kueue DRAExtendedResources feature gate (alpha, disabled by default in v1.34) for quota-aware scheduling of DRA workloads [6]. Volcano PodGroup DRA awareness: not yet present in the current Volcano release — queue quota tracking is inaccurate for DRA claims.
The NVIDIA DRA driver: what is supported today
NVIDIA's DRA driver for Kubernetes is hosted at github.com/NVIDIA/k8s-dra-driver-gpu. The driver has two components: a ResourceSlice publisher (DaemonSet per node) that writes device attributes into the API server, and a kubelet plugin that handles device setup at pod start via CDI rather than the legacy device-path mounting used by device plugins [7].
The driver README is explicit: "While some GPU allocation features can be tried out, they are not yet officially supported." The kubelet plugin is disabled by default even when the driver is bundled with recent GPU Operator versions. The GPU Operator's Helm values must explicitly enable the DRA DaemonSet.
One feature that is officially supported in the NVIDIA DRA driver: the ComputeDomain feature for Multi-Node NVLink (MNNVL) topologies — configurations where multiple nodes are connected via NVLink fabric (e.g. GB200 NVL72 racks). This is the first production use-case for DRA on NVIDIA hardware because the device-plugin model has no mechanism to represent cross-node fabric membership at all [7].
CDI (Container Device Interface) is a prerequisite. DRA relies on CDI for device injection into container namespaces rather than legacy device-path bind-mounts. CDI must be enabled in the container runtime configuration, and the NVIDIA driver version must be >= 580. Verify CDI readiness against your GPU Operator release notes before enabling the DRA kubelet plugin — CDI configuration support varies across Kubernetes distributions and operator versions.
Coexistence: DRA and device plugin on the same cluster
DRA and the device plugin can run simultaneously on the same cluster — and on the same node. A node can advertise nvidia.com/gpu via the device plugin (for existing workloads using resources.limits) while also running the DRA kubelet plugin (for workloads using resourceClaims). Migration is per-workload, not per-cluster, which allows incremental adoption.
The integrations that break without modification during migration are predictable:
- Gang-scheduling queue managers (e.g. Volcano PodGroup) count
resources.limits[nvidia.com/gpu]; DRA claims inspec.resourceClaimsare invisible to the quota engine — queue quotas will be inaccurate until the scheduler adds DRA awareness. - Quota-aware admission controllers that model GPU resources as extended resources (e.g. Kueue's ResourceFlavor) need the DRAExtendedResources feature gate enabled and tuned for DRA claims — the alpha feature gate is disabled by default in v1.34.
- Helm charts with
resources.limits["nvidia.com/gpu"]must be updated to useresourceClaims— the two request styles are not interchangeable. - nvidia-smi inside containers works with CDI device injection, but requires NVIDIA Driver >= 580 and CDI enabled in containerd. Legacy device-path mounting used by the device plugin is not used by the DRA driver.
Where DRA sits in the scheduling stack
The one place DRA genuinely changes the stack is at the Cluster Autoscaler. Autoscaler scale-out decisions for GPU workloads have historically required heuristics or custom expanders because device-plugin allocation was opaque. With ResourceSlice data available in the API, Autoscaler gains the same structured visibility the scheduler has — enabling correct bin-packing simulation when deciding whether to provision a new GPU node.
When to adopt: a pragmatic guide
DRA is GA in the API, but ecosystem readiness is not uniform. A useful frame for adoption decisions:
Hold: cluster on Kubernetes < 1.34, or production queue accounting depends on Volcano/Kueue
If your cluster runs Kubernetes 1.31 or 1.32 with GPU workloads managed through Volcano queues and Kueue quotas, stay on the device-plugin path. The stable nvidia.com/gpu extended-resource path is mature, the observability integration (DCGM exporter → Prometheus) works cleanly with it, and introducing DRA before the queue managers gain DRA awareness creates quota accounting blind spots that are hard to debug.
Pilot: cluster on Kubernetes >= 1.34, evaluating NVLink topologies or inference bin-packing
Pilot DRA when your cluster runs Kubernetes >= 1.34 and one of the following applies:
- You are evaluating multi-node NVLink topologies (e.g. GB200 NVL72) where DRA's ComputeDomain feature — the first officially supported NVIDIA DRA feature — addresses a gap the device plugin cannot fill.
- You are running inference workloads where memory-bound fractional requests would meaningfully improve bin-packing over fixed MIG profiles — and Kubernetes >= 1.36's Partitionable Devices beta is available.
- You are building a new workload class with no existing Helm-chart debt and can design for
resourceClaimsfrom the start.
In any pilot: run the device plugin in parallel on the same nodes, monitor quota accounting gaps in your queue manager, and verify CDI configuration on your container runtime before enabling the DRA kubelet plugin.
Honest caveats: what DRA does not yet do
- No on-demand MIG repartitioning at claim time. DRA advertises current MIG profiles in ResourceSlice, but dynamically creating a new MIG partition in response to an incoming claim is not part of the stable v1.34 API. Partitionable Devices (KEP-4815) is the mechanism that will eventually enable this; it was beta (enabled by default) in v1.36. MIG profiles still must be pre-configured at the node level for the stable path [5].
- GPU allocation is not yet production-supported by NVIDIA. The NVIDIA DRA driver README states explicitly that general GPU allocation features are not yet officially supported. ComputeDomain for MNNVL is the exception. Monitor the driver's release notes — not blog posts — before committing production workloads [7].
- Queue manager integration is partial. Kueue's DRA integration (DRAExtendedResources) was alpha and disabled by default in v1.34. Volcano's PodGroup quota accounting does not count DRA claims in the current release. Running DRA workloads through these queue managers without explicit DRA integration produces quota accounting gaps.
- DCGM observability is not DRA-aware. DCGM exporter reports per-device GPU utilisation metrics regardless of how the device was allocated — device plugin or DRA. However, correlating a DCGM metric to a specific ResourceClaim (rather than to a pod via extended-resource labels) requires updated dashboards and metric labelling. The existing Prometheus integration works, but claim-level attribution needs manual work.
References
- Kubernetes Blog — "Kubernetes v1.34: DRA has graduated to GA" (September 2025).
- Kubernetes SIG-Node — KEP-4381: DRA Structured Parameters README (motivation, design, graduation criteria). kubernetes/enhancements on GitHub.
- Kubernetes SIG-Node — KEP-3063: Dynamic Resource Allocation (control-plane controller model, withdrawn as primary path in 1.32). kubernetes/enhancements issue tracker.
- Kubernetes Blog — "Kubernetes v1.36 Release" (April 2026). DRA Admin Access GA, Prioritized List GA, Partitionable Devices beta, Device Taints beta, Consumable Capacity beta noted in release highlights.
- Kubernetes SIG-Node — KEP-4815: Partitionable Devices (alpha in 1.33, beta in 1.36). kubernetes/enhancements issue tracker.
- Kueue — "Dynamic Resource Allocation" concepts page. kueue.sigs.k8s.io. (DRAExtendedResources feature gate, integration architecture.)
- NVIDIA — k8s-dra-driver-gpu repository. GitHub. (ComputeDomain official support; GPU allocation not yet officially supported; kubelet plugin disabled by default.)
- Kubernetes Documentation — "Dynamic Resource Allocation" concept page (updated for v1 API in Kubernetes 1.34).
Tags
About the Author

asleekgeek
Senior Developer, Architect, DevOps
Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist
Thanks for reading! Explore more articles.
Back to Articles