The GPU scheduling stack: queue admission, gang scheduling, and hardware abstraction in three layers

·9 min read·asleekgeek
Three-layer diagram showing Kueue quota admission above Volcano gang scheduling above the NVIDIA GPU Operator hardware abstraction

The three-layer GPU scheduling stack: quota admission, gang scheduling, hardware abstraction.

Three components appear repeatedly in GPU-capable Kubernetes platforms: a job-queue admission controller, a gang-aware batch scheduler, and a GPU operator. Each is documented in isolation. The documentation rarely makes the responsibility boundary between them explicit — which layer decides whether a job runs, which layer decides where, and which layer decides how many GPU units exist in the first place.

This article maps each layer to a single question, traces a job through all three, and identifies which layer fails first in the three most common production failure modes. It is Part 7, Article 25 of the AI Platform Engineering & MLOps series.

One sentence per layer

Before the detail, the boundary in one sentence each:

  • Kueue asks: does this job fit within the team's quota right now?
  • Volcano asks: can all pods in this job be placed simultaneously?
  • The GPU Operator answers: how many schedulable GPU units does each node advertise to the kubelet?

Each layer is ignorant of the others' internals. Kueue counts resource units; it does not know whether those units come from MIG partitions or time-slicing. Volcano places pods; it does not know whether the admission quota has already been checked. The GPU Operator exposes resources; it does not know about queues at all. This loose coupling is the design — but it also means a misconfiguration in any layer is invisible to the other two.

Layer 1 — Kueue: quota admission

Kueue is a Kubernetes-native job queueing system developed under SIG Scheduling, introduced to the Kubernetes ecosystem in October 2022. It does not replace the kube-scheduler — it sits in front of it, deciding whether a job is allowed to run at all before the scheduler ever sees the pods.

Four primitives

Kueue's data model has four core primitives, documented in the Kueue concepts reference:

  • ResourceFlavor — names a class of capacity (e.g. a node pool carrying A100 GPUs vs one carrying L40S GPUs). A ResourceFlavor maps to node labels and tolerations.
  • ClusterQueue — the quota gate at cluster scope. A ClusterQueue owns a slice of the ResourceFlavor's capacity: e.g. team-a may use up to 8 nvidia.com/gpu units.
  • LocalQueue — a namespace-scoped pointer into a ClusterQueue. Teams submit jobs to their namespace's LocalQueue; the LocalQueue forwards admission requests to the ClusterQueue.
  • Workloadthe admission object wrapping a Job, JobSet, PyTorchJob, or similar. Kueue holds the Workload in a Pending state (with the underlying pods suspended) until the ClusterQueue has capacity. When capacity is available, Kueue unsuspends the Workload and the scheduler sees the pods for the first time.

What Kueue does not do

Kueue does not place pods. It holds or releases a job; once released, placement is entirely the scheduler's concern. This is stated explicitly in the ClusterQueue concepts documentation. The implication: if you need all pods of a distributed job to land simultaneously (gang scheduling), you need an additional mechanism at the placement layer — Kueue's release of a Workload does not guarantee atomic placement.

Layer 2 — Volcano: gang scheduling and placement

Volcano is a CNCF incubating batch scheduling project (CNCF projects page). It extends Kubernetes with a batch scheduling framework that adds two capabilities the default kube-scheduler does not provide: gang scheduling and topology-aware placement.

Why gang scheduling matters for distributed training

A distributed training job running PyTorch DDP or DeepSpeed requires all worker processes to start before any of them can begin the first communication round. Without gang scheduling, a partial allocation is possible: some workers start, others cannot schedule because the remaining nodes are occupied, and the running workers hold their GPU allocations while waiting for peers that never arrive. This is a starvation deadlock — documented in the Volcano project documentation. Volcano's response is the PodGroup: all members of the group must be satisfiable before any pod is bound to a node.

Key primitives

  • PodGroup — the atomic scheduling unit. The minMember field sets how many pods must be satisfiable before any are bound. Setting minMember lower than the job's actual worker count defeats the gang guarantee.
  • Queue — a Volcano-level fair-share queue with configurable weights. Multiple teams can share a cluster-level capacity pool; the Queue CRD carries a capability ceiling and a reclaimable flag (allowing other teams to borrow idle quota).
  • TopologyPolicy — routes workers to nodes that share NVLink or InfiniBand, reducing inter-node communication latency on multi-node training. The preferSingleSocket value is the practical default for most GPU node configurations. The preferSingleNUMANode value may cause jobs to wait indefinitely on nodes where GPUs span multiple NUMA domains (a common topology on multi-socket servers with more than four GPUs); verify your node NUMA topology with numactl --hardware before using this value in production.

How Kueue and Volcano compose

The two systems compose without conflict. Kueue manages quota — whether a job is allowed to consume resources at all. Volcano manages placement — whether all the pods of an admitted job can land simultaneously. The handoff point is the Workload unsuspend event: when Kueue determines quota is available and lifts the suspension, the pods become visible to the scheduler; if the scheduler is Volcano, it then applies PodGroup gang semantics before binding any pod.

Running two independent quota systems simultaneously — one at the Kueue layer and one at the Volcano Queue layer — creates conflicting admission signals and should be avoided. The recommended pattern is: Kueue owns the quota contract with teams; Volcano Queue weights provide fair-share placement ordering within the admitted set.

Layer 3 — the GPU Operator: hardware abstraction

The NVIDIA GPU Operator is a Kubernetes operator that bundles the full software stack a GPU node needs: the driver, NVIDIA Container Toolkit, device plugin, Node Feature Discovery, GPU Feature Discovery, DCGM exporter for metrics, and MIG manager. One Helm release replaces a stack of host-level installs and independent DaemonSets.

What the device plugin does

The NVIDIA device plugin registers nvidia.com/gpu as an extended resource with the kubelet on each node. Once registered, pods can request it like any other resource in a container spec. The count exposed depends entirely on how the operator is configured: a node with 8 physical A100s might advertise 8 units (whole-GPU mode), 64 units (time-slicing with 8× replication per card), or a set of MIG slice units (e.g. 56 units under the 1g.10gb profile across 8 GPUs).

DCGM exporter and observability

The bundled DCGM exporter emits GPU metrics directly into a Prometheus scrape endpoint. The metrics most useful for platform utilisation tracking are DCGM_FI_DEV_GPU_UTIL (SM utilisation percentage), DCGM_FI_DEV_FB_USED (framebuffer memory used), and DCGM_FI_PROF_PIPE_TENSOR_ACTIVE (tensor core active percentage — the most direct signal for whether a training job is doing useful compute). Without DCGM exporter, GPU utilisation is invisible to the platform; the first sign of a problem is usually a job that runs for far longer than expected.

Driver installation strategy

The operator's driver component is correct for bare-metal nodes where no driver is pre-installed. On managed Kubernetes distributions that ship a pre-installed GPU driver (a common pattern on managed cloud node pools), driver.enabled=false is the safe configuration — it allows the operator to provide the device plugin, DCGM, and MIG manager without conflicting with the already-loaded driver module. Mixing operator-managed and pre-installed drivers on the same node leaves the node in a NotReady state due to module-load conflicts.

A job flowing through all three layers

Following a PyTorchJob submission from a data science team makes the layer interactions concrete:

  1. The team submits a PyTorchJob to their namespace. The Kueue webhook intercepts the creation and immediately suspends all pods. Kueue wraps the job in a Workload object and queues it against the team's LocalQueue.
  2. Kueue checks the team's ClusterQueue. If the requested nvidia.com/gpu count fits within the remaining quota (including any borrowable capacity from a cohort), Kueue unsuspends the Workload.
  3. The unsuspended pods are visible to the scheduler. If the cluster uses Volcano as its scheduler (via schedulerName: volcano in the pod spec), Volcano checks the PodGroup associated with the job. The PodGroup's minMember value must be satisfiable — all workers placeable at the same time — before any pod is bound.
  4. Once all pods are placeable, Volcano binds them atomically to GPU nodes. The kubelet on each target node consults the device plugin's resource registry to allocate the exact GPU units (whole cards, MIG slices, or time-sliced replicas) to the container.
  5. The NVIDIA Container Toolkit injects the allocated GPU devices into the container's cgroup. The training process starts with exclusive access to the allocated GPU units.

A stack overview (diagram)

The responsibility boundary at each layer:

three-layer-gpu-scheduling-stack.mermaid
flowchart TB
    subgraph L1["Layer 1 — KUEUE: Quota admission"]
        direction LR
        K1["ResourceFlavor"] --> K2["ClusterQueue"]
        K2 --> K3["LocalQueue"]
        K3 --> K4["Workload\n(held Pending until quota\navailable, then unsuspended)"]
    end

    subgraph L2["Layer 2 — VOLCANO: Gang scheduling + placement"]
        direction LR
        V1["PodGroup\n(minMember)"] --> V2["Queue\n(fair-share weights)"]
        V2 --> V3["TopologyPolicy\n(NVLink / InfiniBand affinity)"]
    end

    subgraph L3["Layer 3 — NVIDIA GPU OPERATOR: Hardware abstraction"]
        direction TB
        G1["Device Plugin\n(nvidia.com/gpu count)"] --> G2["Whole-card mode\n8 GPUs → 8 units"]
        G1 --> G3["Time-slicing\n8 GPUs × 8 replicas → 64 units"]
        G1 --> G4["MIG Manager\n(A100/H100 hardware partitions)"]
        G1 --> G5["MPS Server\n(shared CUDA context)"]
        G6["DCGM Exporter"] --> G7["Prometheus\n(utilisation metrics)"]
    end

    L1 -->|"unsuspends job when quota clears"| L2
    L2 -->|"binds pods to GPU nodes atomically"| L3

    style L1 fill:#1a365d,stroke:#4299e1,color:#fff
    style L2 fill:#1a3a1a,stroke:#48bb78,color:#fff
    style L3 fill:#3d1a1a,stroke:#fc8181,color:#fff

Three common failure modes: which layer fails first

Understanding the layer boundaries makes it faster to diagnose production problems. Each failure mode has a clear first-failure layer:

Failure mode 1: jobs queue indefinitely despite available nodes

First-failure layer: Kueue (quota admission). A job visible to the cluster but not progressing is held at the Workload level. The diagnostic is: check the Workload object's status conditions. If the condition is QuotaReserved: False with reason Pending, the ClusterQueue has no available quota. The node availability is irrelevant — Kueue will not release the job until the quota condition is met, regardless of how many GPU nodes are idle.

Common causes: another team's jobs are holding their allocated quota and have not completed; the ClusterQueue's nominalQuota is set too low for the job's resource request; a borrowing cohort is exhausted.

Failure mode 2: training jobs stall after partial pod placement

First-failure layer: Volcano (gang scheduling) — or its absence. If some pods of a distributed job start and others remain Pending, a partial placement has occurred. This is the starvation deadlock pattern: the running pods are waiting for communication partners that are stuck behind other jobs in the queue. The running pods hold their GPU allocations; the pending pods cannot schedule because the nodes are occupied.

Root cause: the PodGroup is missing or has minMember set below the job's actual replica count. Without a valid PodGroup, Volcano schedules the job as individual pods with no gang guarantee. The fix: verify the PodGroup is created alongside the job and that minMember equals the total worker count.

Failure mode 3: GPU nodes show Ready but no GPUs are schedulable

First-failure layer: GPU Operator (hardware abstraction). If nodes are Ready but kubectl describe node shows zero nvidia.com/gpu capacity, the device plugin has not registered the resource. This is the GPU Operator layer. Common causes: driver container did not start (check the nvidia-driver DaemonSet pods); device plugin DaemonSet pod is in CrashLoopBackOff; an operator driver installation conflicted with a pre-existing driver on the host (common when driver.enabled is left true on a node image that already ships a driver).

The diagnostic path: check the nvidia-operator-validator pod. A validator stuck in Init is the canonical signal that the GPU stack is unhealthy at the hardware abstraction layer. The Kueue and Volcano layers are healthy but have nothing to schedule — their resource counts show zero because the device plugin reports zero.

Configuration sketch

The following fragments illustrate the three-layer wiring. They are illustrative — tune namespaces, resource counts, and GPU SKU labels to your environment.

kueue-cluster-queue.yaml
# Layer 1: Kueue — quota admission
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: team-a-queue
spec:
  namespaceSelector: {}
  cohort: shared-gpu-pool
  resourceGroups:
    - coveredResources: ["nvidia.com/gpu"]
      flavors:
        - name: gpu-a100
          resources:
            - name: nvidia.com/gpu
              nominalQuota: "8"       # guaranteed share
              borrowingLimit: "4"     # may borrow up to 4 extra
  preemption:
    reclaimWithinCohort: Any          # reclaim guaranteed share if borrowed
volcano-podgroup.yaml
# Layer 2: Volcano — gang scheduling
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: pytorchjob-workers
  namespace: team-a
spec:
  minMember: 4            # ALL 4 workers must be placeable before any are bound
  minResources:
    nvidia.com/gpu: "4"
  queue: team-a-volcano-queue
  priorityClassName: gpu-training
gpu-operator-helm-values.yaml
# Layer 3: GPU Operator — hardware abstraction
# Bare-metal / self-managed node example (driver.enabled=true)
driver:
  enabled: true           # set false on nodes with a pre-installed driver
toolkit:
  enabled: true
devicePlugin:
  enabled: true
dcgmExporter:
  enabled: true
  serviceMonitor:
    enabled: true         # scrape by Prometheus Operator
migManager:
  enabled: true           # A100/H100 MIG partitioning
nodeStatusExporter:
  enabled: true
# Resource flavor label used by Kueue ResourceFlavor nodeLabels
nodeFeatureDiscovery:
  worker:
    tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

Quota borrowing and preemption

Kueue's cohort model allows multiple ClusterQueues to share a borrowing pool. A team with a nominalQuota of 8 GPUs can borrow additional capacity from another team's idle allocation, up to a configured borrowingLimit. When the lending team submits a new job and needs its quota back, preemption reclaims it. This is the mechanism that makes a fixed GPU pool feel larger than its physical count — idle quota does not sit unused while other teams' jobs wait.

The Kueue metrics to monitor for this behaviour are kueue_pending_workloads (rising count indicates queue pressure), kueue_admitted_workloads_total (rate of successful admission), and kueue_quota_reserved_resources (current utilisation against nominalQuota per ClusterQueue).

Alternatives to this stack composition

The three-layer stack (Kueue + Volcano + GPU Operator) is one composition. Others are viable depending on workload mix:

  • Kueue + default scheduler + GPU Operator: appropriate for single-pod or small-scale jobs that do not require gang semantics. Lower operational overhead; loses topology-aware placement and gang guarantees.
  • Kueue coscheduling plugin + GPU Operator: the Kubernetes SIG Scheduling coscheduling plugin provides gang admission as a scheduler plugin rather than a separate scheduler. Less operationally heavy than Volcano; less mature for topology-aware placement.
  • Apache YuniKorn + GPU Operator: an alternative batch scheduler with gang scheduling and capacity management. Broader workload support (Spark, Flink natively), less adoption in the Kubernetes MLOps ecosystem as of 2026.
  • Standalone device plugin DaemonSet (no GPU Operator): possible but requires managing the driver, container toolkit, and DCGM exporter separately. Increases operational surface at the hardware abstraction layer.

What comes next: the GPU-sharing decision tree

The stack described above treats each nvidia.com/gpu unit as a whole-card allocation by default. Whether to partition those cards — and which mechanism to use (time-slicing, MPS, MIG, or fractional GPU virtualization) — is a separate decision that belongs at the GPU Operator layer. That decision has a significant effect on the isolation and latency jitter profile seen by jobs. The next article in this series, the GPU-sharing decision tree (Article 27), provides a structured framework for that choice — starting from workload type (training vs inference) and hardware generation (MIG-capable vs not).

References

  1. "Introducing Kueue." Kubernetes Blog, SIG Scheduling, October 2022. https://kubernetes.io/blog/2022/10/04/introducing-kueue/
  2. Kueue Concepts (ResourceFlavor, ClusterQueue, LocalQueue, Workload). Kueue project documentation, kubernetes-sigs/kueue. https://kueue.sigs.k8s.io/docs/concepts/
  3. Kueue Workload concept. Kueue project documentation. https://kueue.sigs.k8s.io/docs/concepts/workload/
  4. Kueue ClusterQueue concept (admission, placement boundary). Kueue project documentation. https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/
  5. "Cloud Native Batch System Volcano moves to the CNCF Incubator." CNCF Blog, April 2022. https://www.cncf.io/blog/2022/04/07/cloud-native-batch-system-volcano-moves-to-the-cncf-incubator/
  6. Volcano project page. CNCF (Cloud Native Computing Foundation). https://www.cncf.io/projects/volcano/
  7. Volcano documentation (PodGroup, Queue, gang scheduling mechanics). Volcano project. https://volcano.sh/en/docs/
  8. NVIDIA GPU Operator documentation (components: driver, toolkit, device plugin, DCGM exporter, MIG manager, NFD, GFD). NVIDIA Corporation. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
  9. NVIDIA Kubernetes Device Plugin (nvidia.com/gpu extended resource registration). NVIDIA Corporation, GitHub. https://github.com/NVIDIA/k8s-device-plugin
  10. DCGM Exporter documentation (GPU metrics for Prometheus: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_PROF_PIPE_TENSOR_ACTIVE). NVIDIA Corporation. https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html

Tags

#gpu#kueue#volcano#scheduling#series:ai-platform-mlops#series-order/25

About the Author

asleekgeek

asleekgeek

Senior Developer, Architect, DevOps

Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist

Thanks for reading! Explore more articles.

Back to Articles