Training workloads on Kubernetes — operators, gang scheduling, and checkpointing

·9 min read·asleekgeek
Diagram showing distributed training pods coordinated by a gang scheduler on a Kubernetes cluster

Gang-scheduled training pods across a multi-node GPU cluster

Distributed model training is the most infrastructure-intensive workload in the AI platform stack. A single multi-node job can hold hundreds of GPU-hours hostage for days, fail silently if one pod loses its network neighbour, and leave the cluster in a resource deadlock if the scheduler does not understand collective-communication semantics. Getting this right is an infrastructure problem first — the framework (PyTorch, JAX, MXNet) is largely irrelevant until the cluster can schedule, network, and checkpoint correctly.

This article covers the three dominant Kubernetes CRDs for training jobs, why gang scheduling is not a performance optimisation but a correctness guarantee, how to choose a checkpoint cadence from first principles, and what the data path looks like when you scale beyond a single node.

Three CRDs, three tradeoffs

The Kubeflow Training Operator provides two of the three dominant CRDs; KubeRay provides the third. Each reflects a different distributed-training paradigm.

PyTorchJob

PyTorchJob is the default CRD for PyTorch distributed training on Kubernetes, maintained by the Kubeflow Training Operator. It manages a set of pods — one per rank — and injects the rendezvous environment variables (MASTER_ADDR, WORLD_SIZE, RANK) that PyTorch DDP and FSDP consume at startup. Choose PyTorchJob when your codebase already uses the standard PyTorch distributed runtime — it is the most direct path from a local multi-GPU script to a multi-node cluster job, and the Kubeflow community has joined it into the official PyTorch ecosystem.

MPIJob

MPIJob wraps a standard MPI launcher/worker topology. It is the right choice when your training code relies on MPI collectives directly — common in legacy HPC code ported to deep learning, or in frameworks such as Horovod that sit on top of MPI. The operator provisions an MPI launcher pod alongside worker pods and handles SSH key distribution, hostfile generation, and cleanup. If your team is already using Horovod and does not want to rewrite communication primitives, MPIJob avoids a framework migration.

RayJob

RayJob, provided by the KubeRay operator, uses Ray's actor model rather than MPI or PyTorch collective primitives. Choose RayJob when you need heterogeneous task graphs — for example, a preprocessing stage on CPU workers feeding a training stage on GPU workers — or when your team is already invested in the Ray ecosystem for data processing and wants a single scheduling primitive across the pipeline. Ray's elastic training capabilities (it can continue with fewer workers if one fails) also make it better suited to spot-instance pools than either PyTorchJob or MPIJob, which treat loss of a rank as fatal.

Gang scheduling: a correctness requirement, not a performance hint

Distributed training with NCCL, Gloo, or MPI requires all ranks to rendezvous before any collective operation (AllReduce, AllGather, ReduceScatter) can proceed. Without gang scheduling — the guarantee that all pods in a job are scheduled simultaneously — two classes of failure emerge, both confirmed in Kubernetes upstream documentation and batch-scheduling design notes.

Wasted compute: a 7-of-8-node job gets scheduled and begins consuming GPU-hours, but NCCL's rendezvous blocks indefinitely waiting for the eighth rank. The job burns resources and produces nothing until it times out.

Resource deadlock: two jobs each requiring 4 GPUs compete for a 6-GPU pool. Job A gets 3 GPUs and waits; Job B gets 3 GPUs and waits. Neither can proceed, and neither will release what it holds. The cluster stalls until an operator intervenes.

Gang scheduling solves both by treating the N pods of a job as an atomic unit: all-or-nothing admission. The Volcano scheduler implements this via its PodGroup abstraction, where minMember declares the quorum required before any pod is bound to a node. Kueue (the upstream Kubernetes batch-scheduling project) enforces the same guarantee at the queue level. The Kubeflow Training Operator integrates with both schedulers. When evaluating scheduler options, confirm that your chosen batch scheduler exposes a min-member or equivalent all-or-nothing admission guarantee before deploying multi-node training.

Gang-schedule lifecycle

gang-schedule-lifecycle.mermaid
sequenceDiagram
    participant U as User (kubectl apply)
    participant S as Batch Scheduler
    participant PG as PodGroup
    participant N1 as Node A (GPU)
    participant N2 as Node B (GPU)

    U->>S: Submit PyTorchJob (replicas=8)
    S->>PG: Create PodGroup (minMember=8)
    S->>S: Accumulate resources — wait for 8×GPU headroom
    Note over S: All-or-nothing admission gate
    S->>N1: Bind pods 0-3
    S->>N2: Bind pods 4-7
    N1-->>PG: Ranks 0-3 Running
    N2-->>PG: Ranks 4-7 Running
    PG-->>S: PodGroup Ready (minMember met)
    N1->>N2: NCCL rendezvous (all 8 ranks)
    Note over N1,N2: Training begins

Checkpoint cadence: deriving the interval from first principles

Checkpointing too often wastes I/O bandwidth and stalls forward progress; checkpointing too rarely means long recompute windows when a node fails. The optimal interval was formalised by Young (1974) and later refined by Daly (2006), and is now standard in HPC fault-tolerance literature.

Young's first-order approximation gives the optimal checkpoint interval as T_opt ≈ √(2 × C × μ), where C is the time to write one checkpoint and μ is the mean time between failures (MTBF) of the job. Daly's higher-order refinement incorporates the restart time R, producing T_opt ≈ √(2μC) − C − R/2, which matters when restart latency is significant relative to the checkpoint write time. A 2022 survey of both formulas is available for practitioners needing implementation detail.

Translating this to a concrete cluster: suppose checkpointing a 7B-parameter model takes 3 minutes to an NFS PVC (C = 3 min), and the cluster's historical node MTBF is 8 hours (μ = 480 min). Young's formula gives T_opt ≈ √(2 × 3 × 480) ≈ 54 minutes — approximately one checkpoint per hour. On a faster object-store path (C = 30 seconds), the interval tightens to T_opt ≈ √(2 × 0.5 × 480) ≈ 22 minutes. If your storage team cannot tell you the cluster MTBF, use your ticketing system's last 30-day P0 node-failure rate as a proxy.

Two practical notes fall out of the formula. First, object storage is almost always the better checkpoint target than a PVC: lower C means more frequent checkpoints with less overhead, and the checkpoint is decoupled from the lifecycle of the pods that wrote it. Second, if your job uses asynchronous checkpointing (writing the checkpoint in a background thread while training continues), C in the formula shrinks toward the serialisation latency rather than the full write latency — a meaningful improvement for large models.

Multi-node networking: NCCL and the fabric hierarchy

NCCL implements the collective primitives (AllReduce, AllGather, Broadcast, Reduce, ReduceScatter) used by PyTorch DDP, FSDP, and DeepSpeed. Its performance is strongly topology-dependent: intra-node communication over NVLink is substantially faster than inter-node communication over InfiniBand or RoCE, which in turn outperforms TCP/IP by a significant margin at the bandwidths required for gradient synchronisation. Network topology therefore determines whether your multi-node scaling is communication-bound or compute-bound.

On a Kubernetes cluster, NCCL selects its transport via environment variables (NCCL_IB_DISABLE, NCCL_SOCKET_IFNAME). The NVIDIA technical blog on scaling with NCCL documents GPUDirect RDMA achieving approximately 11 GB/s with InfiniBand EDR or RoCE at 100 GbE. Meta's SIGCOMM 2024 paper on RoCE at scale documents the operational choices required to make Ethernet-based RDMA reliable at training-cluster scale: explicit congestion notification (ECN), priority flow control (PFC), and careful buffer tuning at the ToR switch layer.

For a platform engineer who does not own the fabric, the actionable checklist is: (1) confirm that worker pods have access to the RDMA device (via a CNI plugin or device plugin that exposes the IB/RoCE interface); (2) set NCCL transport environment variables explicitly rather than relying on auto-detection; (3) provision pods on nodes within the same rack or spine where possible, using affinity rules or a topology-aware scheduler plugin — cross-rack AllReduce at TCP speeds makes multi-node training impractical for most model sizes.

Data path: object store vs PVC

Training data access falls into two patterns, and the choice affects both job startup time and steady-state throughput.

Object store (S3-compatible): datasets live in object storage and are streamed or prefetched by the training process. Startup latency is low (no bulk copy before training begins), the dataset is not duplicated per-node, and the path scales naturally. The tradeoff is that random-access patterns perform poorly — tokenised datasets in columnar formats (e.g. WebDataset shards or Parquet) that support sequential access are better suited to this path than unstructured file trees.

PVC (ReadWriteMany): the dataset is pre-staged onto a shared volume. Every worker reads from the same namespace, which is convenient for heterogeneous file layouts. The constraint is that ReadWriteMany PVCs backed by NFS or similar network filesystems introduce contention at high worker counts — a 64-worker job all opening the same file tree simultaneously will saturate the NFS server's metadata path before it saturates the GPUs. For large-scale jobs, a caching layer (e.g. a distributed cache sidecar per node that prefetches from object storage) is the standard mitigation.

A practical heuristic: use object storage for datasets larger than 100 GB and for jobs with more than 8 workers. Use a PVC for smaller datasets where the convenience of a POSIX file tree outweighs the scalability ceiling, or when your data pipeline tooling requires filesystem semantics.

Resumability and the operator's lifecycle contract

A training job is resumable if it can restart from a checkpoint rather than from epoch zero when a node fails or is preempted. Resumability requires two things to be true simultaneously: the training code must write checkpoints (model weights, optimiser state, RNG state, step counter) to durable storage, and the operator must be configured to restart the job rather than mark it failed.

In PyTorchJob, the restartPolicy field on the replica spec controls this. Setting it to OnFailure (pod-level) or ExitCode (distinguishing preemption exit codes from training errors) determines whether the operator re-queues the failed pods or transitions the job to a terminal Failed state. Integrating this with the gang scheduler adds a subtlety: the restart must re-acquire the full gang, not just the failed rank — confirm that your scheduler plugin handles re-admission of partially-running jobs.

Model-level checkpoint format matters for cross-run compatibility. Saving only the model weights (e.g. with torch.save(model.state_dict())) is insufficient for resumption — the optimiser state (Adam momentum, variance estimates) accounts for a significant fraction of memory and must be saved alongside the weights. For FSDP jobs, each rank saves its own shard; the checkpoint directory must accommodate N files for an N-rank job. Frameworks such as Torch Distributed Checkpoint provide a unified sharded-save API that avoids the rank-0-only bottleneck of naive torch.save.

Common failure modes

NCCL timeout without gang scheduling. Workers that start without their full peer set will hang at the rendezvous barrier. The symptom is a job that appears Running in kubectl but whose GPU utilisation reads near zero on every node — compute has not started because collective communication has not initialised.

Data-path starvation. GPU utilisation oscillates between 0% (waiting for a batch) and 100% (computing) rather than running at a sustained high level. The diagnostic is to check DataLoader worker count and prefetch depth: if the GPU is faster than the data pipeline, increasing num_workers and prefetch_factor is often the fastest fix. On an object-store data path, the bandwidth ceiling of the node's network interface may be the root cause rather than CPU or DataLoader concurrency.

Checkpoint I/O stalls. If the checkpoint write path is synchronous and contends for the same NFS or object-store bandwidth as the data path, the training loop pauses while the checkpoint flushes. The symptom is periodic GPU idle spikes at regular intervals matching your checkpoint cadence. Async checkpointing or a dedicated checkpoint storage tier (separate from training-data storage) eliminates this contention.

Rank-0-only checkpoint bottleneck. In naive DDP implementations, rank 0 collects all parameters and writes the checkpoint while other ranks idle. For models above a few billion parameters, this serialisation step takes longer than the optimal checkpoint interval derived from Young's formula, effectively flooring checkpoint frequency. Sharded checkpoint APIs that distribute the write across all ranks avoid this bottleneck at the cost of a more complex checkpoint directory structure.

What the cluster sees

A platform engineer supporting training workloads without running them should be able to classify an incoming job by its observable cluster signature:

  • Long-running pods (hours to days) on GPU nodes, not cycling on a request-response cadence.
  • A matching PodGroup or ClusterQueue admission object alongside the pods.
  • High inter-node bandwidth consumption (AllReduce traffic) during the training phase.
  • Periodic write spikes to object storage or a PVC at intervals matching the checkpoint cadence.
  • A PyTorchJob, MPIJob, or RayJob CRD in the namespace, not a bare Deployment.

If you see long-running GPU pods without a batch CRD and without periodic checkpoint writes, you are likely looking at an inference workload or an interactive session — not a training job. The checkpoint cadence is the clearest distinguishing signal.

References

1. Young, J.W. "A first order approximation to the optimum checkpoint interval." Communications of the ACM, Vol. 17, No. 9, pp. 530–531, 1974.

2. Daly, J.T. "A higher order estimate of the optimum checkpoint interval for restart dumps." Future Generation Computer Systems, Vol. 22, Issue 3, pp. 303–312, 2006.

3. Benoit et al. "Checkpointing à la Young/Daly: An Overview." Proceedings of the 14th International Conference on Contemporary Computing (IC3), 2022.

4. NVIDIA NCCL GitHub repository. NVIDIA Corporation, 2024.

5. NVIDIA Developer Blog. "Scaling Deep Learning Training with NCCL." NVIDIA Corporation.

6. Meta / Stanford. "RDMA over Ethernet for Distributed AI Training at Meta Scale." ACM SIGCOMM, 2024.

7. Kubernetes Documentation. "Gang Scheduling." The Linux Foundation, 2024.

8. Volcano Project Documentation. "PodGroup." CNCF Volcano, 2024.

9. Kubeflow Project. "Job Scheduling." Kubeflow Training Operator Documentation, 2024.

10. Ray Project. "Ray on Kubernetes (KubeRay)." Anyscale / Ray Documentation, 2024.

11. PyTorch Blog. "PyTorch on Kubernetes: Kubeflow Trainer Joins the PyTorch Ecosystem." Meta AI / PyTorch Foundation, 2024.

Tags

#training#kubeflow#series:ai-platform-mlops#series-order/10

About the Author

asleekgeek

asleekgeek

Senior Developer, Architect, DevOps

Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist

Thanks for reading! Explore more articles.

Back to Articles