The five canonical AI/ML workload shapes

Five canonical AI/ML workload shapes and their scheduling primitives
A platform engineer is regularly asked to support a workload they have not personally run — "the data team wants to fine-tune a 13B model; what do they need from us?" Without a taxonomy, the answer is bespoke every time. With one, you can place a new workload into a known shape and answer from the playbook for that shape.
This article defines five canonical AI/ML workload shapes. For each shape, it covers five facets: resource profile, latency budget, lifecycle, primary failure modes, and scheduling fit on Kubernetes. The article closes with a decision table and a five-question classification heuristic. The next three articles in this series apply the playbook to training, inference, and agent workloads respectively.
The five canonical shapes
Before the details: here is the one-paragraph orientation.
- Training — large, bursty, GPU-heavy, fault-tolerant via checkpointing. Multi-node is common; runtimes run from hours to days.
- Fine-tuning — training-shaped but smaller and more frequent. LoRA / QLoRA on a single GPU is the common path; full fine-tunes need multi-GPU.
- Batch inference — throughput-bound, async over object storage, schedulable, can run in off-hours.
- Online inference — latency-bound, 24×7, autoscaling on traffic or queue depth, GPU-sharing for cost.
- Agent / LLM — latency-bound but with retrieval and tool-use spikes. Multi-step, harder to capacity-plan than single-call inference.
The four-line orientation: training is big and slow and resumable; fine-tuning is training but smaller and more frequent; batch inference is throughput on a schedule; online inference is latency 24×7; agent workloads are online inference with extra steps — and the extra steps change almost every operational assumption.
Training
A from-scratch or large-scale training run. Production examples: pre-training an in-house foundation model, training a vision model on a large image corpus, training a recommender on a multi-billion-row event table.
Resource profile
GPU-heavy (8–64+ GPUs is typical for serious training runs), with multi-node jobs being common at scale. CPU and RAM requirements are moderate relative to GPU. Storage reads are large (datasets from object store or a parallel file system); checkpoint writes are large and must be durable. For multi-node runs, the inter-node fabric is a first-class resource: NCCL uses GPUDirect RDMA over InfiniBand or RoCE to execute all-reduce operations at 11 GB/s per IB EDR / 100GbE adapter [1]. Without RDMA — falling back to TCP sockets — bandwidth is capped at roughly half the PCIe bus bandwidth, turning a two-week training run into a months-long one on large clusters.
Latency budget
None on individual steps. Wall-clock matters for iteration speed — a team that finishes an experiment in 12 hours can run twice as many experiments as one that takes 24. There is no user-facing SLA.
Lifecycle
Submit job → gang-scheduled (all N pods allocated or none) → all pods ready → training begins. Execution is resumable from checkpoint; multi-hour to multi-day runtimes are normal. Completion produces a model artifact and metrics written to a model registry.
Failure modes
Single-node hardware failure on a 16-node job halts the entire run; OOM mid-epoch (gradient accumulation or batch size misconfigured); data path saturation (the storage tier cannot feed the GPUs); NCCL communication hangs (mismatched collective timeout, topology mismatch, or RoCE misconfiguration); preemption by a higher-priority workload. Resumability from a recent checkpoint is what makes any of this tractable — without it, every failure restarts from epoch zero.
Scheduling fit
Gang scheduling is a correctness requirement for synchronous distributed training: the job needs all N pods simultaneously, or the pods that do launch sit idle consuming GPU memory while waiting for peers, producing deadlock [2]. The Kubernetes gang scheduling concept (PodGroup + a min-available count) enforces all-or-nothing semantics natively — implementations include Volcano and the upstream KEP-4671 gang scheduling plugin. Quota-aware queuing (e.g. Kueue) adds fairness across tenants and supports cohort borrowing when one team's quota sits idle. GPU node selectors pin the job to the correct SKU.
Fine-tuning
Fine-tuning is training-shaped but smaller. Production examples: per-customer LoRA adapter on a base LLM, retraining a classifier on the most recent week of data, domain-adapting an embedding model after corpus drift.
Resource profile
Often single-GPU. Parameter-efficient methods like LoRA and QLoRA fit a 7B-parameter model fine-tune onto a single 24–80 GB GPU — the adapter represents a small fraction of the base model's parameter count. Full fine-tunes of larger models (13B+) need multi-GPU. Memory budget is tighter than from-scratch training because the adapter weights, the frozen base model, and the optimiser state all compete for VRAM simultaneously.
Latency budget
None. Cadence matters — fine-tunes are often triggered by a drift signal or on a schedule (daily, weekly). Run time is typically minutes to a few hours, which makes retries cheap relative to training.
Lifecycle
Triggered (drift signal or schedule) → executed (minutes to hours) → evaluated against a held-out validation set → registered in the model registry → promoted via the registry lifecycle states (staging → champion).
Failure modes
Proportionally less painful than training failures because runs are shorter. The most common operational failure is hyperparameters that worked on the base model breaking on the fine-tuned variant — learning-rate sensitivity is much higher with LoRA than full fine-tune. Gradient explosion on a poorly configured adapter is also common.
Scheduling fit
Gang scheduling is not required for single-GPU fine-tunes. Quota-queued via a batch queue scheduler. GPU-sharing (MIG partitions or time-slicing) is often appropriate because the fine-tune is small and a full GPU would be underutilised — using a smaller partition frees the rest for concurrent online inference.
Batch inference
Scoring a large input set offline. Production examples: nightly enrichment of a customer record table, weekly relevance scoring of a content corpus, batch image classification of a backlog, embedding generation for a new document collection.
Resource profile
Throughput-bound. GPU-heavy when the model is heavy (LLMs, large vision models); CPU-only when the model is light (classical ML, small classifiers). Storage dominates non-GPU cost: large input reads from object store, large output writes back. Parallelism comes from sharding the input — each worker processes a slice of the corpus independently.
Latency budget
None on individual requests. Total wall-clock matters — the batch must complete within its maintenance window (e.g. before the morning report runs, before the next day's model training ingests the enriched features).
Lifecycle
Scheduled or triggered (a workflow orchestrator fires on a cron, a completion event, or a data-arrival signal) → workers spin up, each consuming one shard → outputs written back to object store → workers spin down → downstream pipeline reads the outputs.
Failure modes
Shard failure mid-batch — requires idempotent shard processing so the failed shard can be retried without double-writing. Output store saturation (object store throttling or a full PVC). GPU OOM on an outlier-sized input (a single very long document can blow the batch buffer). Silent partial completion: the batch appears done but some shards were skipped — output validation after completion is mandatory.
Scheduling fit
Default Kubernetes scheduler with a GPU node selector. Quota-aware queuing for fairness. Workflow orchestration (e.g. Argo Workflows, Ray Batch) manages the shard fan-out and fan-in. Schedule for off-peak hours to use GPUs that would otherwise sit idle between training and serving peaks — this is the primary utilisation argument for batch inference.
Online inference
Real-time scoring of user-facing requests. Production examples: a recommendation API, a fraud-detection scorer, an LLM chat endpoint, an image moderation pipeline.
Resource profile
GPU or CPU depending on model. For LLM serving, memory is dominated by model weights and the KV cache — the intermediate attention key/value tensors that grow with sequence length. Efficient KV cache management (as described in the PagedAttention approach introduced in the vLLM paper [3]) reduces memory fragmentation from ~60–80% to under 4%, directly enabling larger effective batch sizes and higher throughput on the same hardware.
Latency budget
Tight. The convention used in this series: under 100ms p99 is "online"; under 1s is "interactive"; multi-second is "batch". LLM streaming endpoints have a different shape: time-to-first-token (TTFT) matters more than end-to-end latency because the user perceives the stream starting. An elevated TTFT typically signals queuing delay or KV cache pressure.
Lifecycle
Long-running deployment (days to months, not minutes). Autoscaling on traffic (HPA on RPS) or on queue depth and KV-cache utilisation — the latter is the correct signal for LLM serving because CPU utilisation is almost never the bottleneck [4]. The vllm:num_requests_waiting per-replica metric, exposed via Prometheus and consumed by KEDA, is the idiomatic autoscaling trigger. Rollouts via canary, shadow, or A/B deployment.
Failure modes
Cold-start latency on scale-up (model weight loading typically takes tens of seconds to several minutes depending on model size and storage tier — a 70B model loaded from object storage at typical cluster throughput can take two minutes or more before the replica is ready to serve, which breaks any sub-second SLA if the autoscaler triggers too late [6]). Traffic-driven OOM from KV cache exhaustion under long-context requests. GPU-share contention when a co-located training job takes memory. Upstream model registry unavailability blocking new replicas from loading weights during a scale-out event.
Scheduling fit
Default Kubernetes scheduler with a GPU node selector. KEDA for queue-driven or custom-metric autoscaling — the idiomatic pattern for vLLM is a KEDA ScaledObject targeting the vllm:num_requests_waiting Prometheus metric, as documented in the Red Hat Developer guide for KServe+KEDA [6]. Serving runtimes optimised for LLM workloads (e.g. vLLM, Triton with TensorRT-LLM) handle continuous batching and KV cache management internally. Model serving platforms (e.g. KServe, Seldon, BentoML) add lifecycle management, canary routing, and a standard inference API.
Agent workloads
An agent is an LLM-based system that dynamically directs its own processes and tool usage across multiple steps [5]. Unlike a single-call inference endpoint, the agent decides at runtime which tools to call, in what order, and when to stop — the control flow is not pre-determined. Production examples: a customer-support copilot that queries a knowledge base and a ticketing API, a coding assistant that iterates on a tool-use loop, a research assistant that plans then retrieves.
Resource profile
Mixed and asymmetric. The orchestrator (routing logic, prompt assembly, state management) is CPU-bound and lightweight. The LLM calls are GPU-bound or hosted-API-bound. Retrieval steps are vector-DB-bound or search-service-bound. Memory: KV cache for the active context window, retrieval cache for repeated lookups, and session state if the agent maintains conversation history across turns.
Latency budget
Loose on total wall-clock (multi-second to tens of seconds is common and accepted for complex tasks); tight on time-to-first-token for streaming responses; tight on individual tool call round-trips (a 500ms tool call in a 10-step chain adds 5s). Step-level latency budgeting is what matters — the total is the sum of the steps, and each slow step compounds.
Lifecycle
Long-running serving deployment. Each user request spawns a session that may span multiple LLM calls, tool calls, and retrieval steps. The agent must acquire ground truth from the environment at each step — tool call results, retrieved documents, code execution outputs — to assess its progress and decide next actions [5]. Some sessions last seconds; others (autonomous research tasks) last minutes. Per-session state may need to persist across HTTP boundaries.
Failure modes
A tool call that fails silently or returns malformed output (the agent continues on bad state). Retrieval returning no results (the agent hallucinates instead of surfacing the gap). The LLM generating a syntactically invalid tool invocation. The agent looping without converging — infinite retries exhaust the context window and run up token cost. These are primarily observability failures: the system runs but produces wrong answers, and without per-step tracing you cannot tell where the chain broke.
Scheduling fit
Default Kubernetes scheduler. KEDA on session count or request queue depth. Network latency between the orchestrator and the model-serving endpoint matters at every step — co-locate them or ensure sub-millisecond intra-cluster networking. Observability requirements are higher than for any other workload shape: you need per-step traces, token counts per call, tool-call success rates, and session-level error aggregation.
Decision table: shape → scheduling primitive and runtime
For a workload that fits a named shape, this is the default Kubernetes scheduling primitive and serving runtime. On-prem is a fixed-GPU cluster with sunk hardware cost; cloud-managed is an elastic cluster where you pay per GPU-hour.
| Shape | Scheduler | Queue / fairness | Runtime | Cluster preference |
|------------------------|---------------------|-----------------------|------------------------------------------|-------------------------------|
| Training (multi-node) | Gang (all-or-none) | Quota-aware (Kueue) | Training operator (PyTorchJob, MPIJob) | On-prem (sunk GPU cost) |
| Training (single-node) | Default | Quota-aware (Kueue) | PyTorchJob or direct Pod | On-prem |
| Fine-tuning (LoRA) | Default | Quota-aware (Kueue) | Training job or workflow DAG step | On-prem (or shared GPU) |
| Batch inference (GPU) | Default | Quota-aware (Kueue) | Workflow orchestrator + custom container | On-prem default / cloud burst |
| Batch inference (CPU) | Default | Optional | Workflow orchestrator or Ray Batch | Either |
| Online inference | Default | None (HPA / KEDA) | Serving runtime (vLLM, Triton, KServe) | Where latency budget allows |
| Agent workload | Default | None (KEDA / HPA) | Custom orchestrator + LLM runtime | Co-locate orchestrator+model |The dominant pattern: training and fine-tuning prefer an on-prem cluster where GPU cost is sunk; online inference goes wherever the latency budget is met; batch inference goes where the data lives unless burst capacity is cheaper. Agent workloads are unusual in that the orchestrator and the model serving tier should be co-located — each additional network hop adds latency that compounds across steps.
How to classify a workload you haven't seen before
Five questions, asked in order:
- Does it produce a new model artifact? If yes → training or fine-tuning. If the output artifact is the same architecture as the input base → fine-tuning. Otherwise → training.
- Is there a tight per-request latency budget? If yes → online inference or agent. If no → batch inference.
- Does the request fan out into tool calls, retrieval, or multi-step reasoning? If yes → agent. If no → online inference.
- What is the resource shape — GPU-heavy and multi-hour, or GPU-light and short? Multi-hour + GPU-heavy disambiguates training from fine-tuning; duration + GPU intensity disambiguates batch from online.
- What is the failure mode that hurts most — wall-clock time, throughput, or latency? This identifies the scheduling primitive (gang, queue, default) and the observability dimension that matters.
If you are still unsure after five questions, default to treating the workload as batch inference: queue it, give it a GPU node selector, write idempotent shard logic, and add per-shard output validation. Batch inference is the safest default because it makes no availability guarantees (so you won't break an SLA) and its scheduling needs are minimal (so you won't mis-configure a gang scheduler). Upgrade the classification once you observe the actual failure modes in staging.
References
- NVIDIA Technical Blog, "Scaling Deep Learning Training with NCCL", NVIDIA, 2019. developer.nvidia.com
- Kubernetes.io, "Gang Scheduling" (concept documentation), Kubernetes project, 2025. kubernetes.io
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention", SOSP 2023 (Best Paper). arxiv.org/abs/2309.06180
- KServe project, "Autoscaler" (documentation), KServe, 2025. kserve.github.io
- Anthropic, "Building Effective Agents", December 2024. anthropic.com/research/building-effective-agents
- Red Hat Developer, "How to set up KServe autoscaling for vLLM with KEDA", 2025. developers.redhat.com
- Kubernetes SIG-Scheduling, KEP-4671 "Gang Scheduling", kubernetes/enhancements, 2024–2025. github.com/kubernetes/enhancements
Tags
About the Author

asleekgeek
Senior Developer, Architect, DevOps
Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist
Thanks for reading! Explore more articles.
Back to Articles