AI Platform Engineering & MLOps Series · Part 9 of 34

The five canonical AI/ML workload shapes

A platform engineer asked to support any unfamiliar workload needs a taxonomy. This article defines five canonical shapes — training, fine-tuning, batch inference, online inference, and agent — and provides the playbook for each.

10 min read·2 interactive components·7 references

TrainingFine-tuningBatch InferenceOnline InferenceAgent / LLM

A platform engineer is regularly asked to support a workload they have not personally run — “the data team wants to fine-tune a 13B model; what do they need from us?” Without a taxonomy, the answer is bespoke every time. With one, you can place a new workload into a known shape and answer from the playbook for that shape.

This article defines five canonical AI/ML workload shapes. For each shape, it covers five facets: resource profile, latency budget, lifecycle, primary failure modes, and scheduling fit on Kubernetes. The article closes with a decision table and a five-question classification heuristic. The next three articles in this series apply the playbook to training, inference, and agent workloads respectively.

The five canonical shapes

Before the details: here is the one-paragraph orientation.

Training — Large, bursty, GPU-heavy, fault-tolerant via checkpointing. Multi-node is common; runtimes run from hours to days.
Fine-tuning — Training-shaped but smaller and more frequent. LoRA / QLoRA on a single GPU is the common path; full fine-tunes need multi-GPU.
Batch inference — Throughput-bound, async over object storage, schedulable, can run in off-hours.
Online inference — Latency-bound, 24×7, autoscaling on traffic or queue depth, GPU-sharing for cost.
Agent / LLM — Latency-bound but with retrieval and tool-use spikes. Multi-step, harder to capacity-plan than single-call inference.

The four-line orientation: training is big and slow and resumable; fine-tuning is training but smaller and more frequent; batch inference is throughput on a schedule; online inference is latency 24×7; agent workloads are online inference with extra steps — and the extra steps change almost every operational assumption.

The explorer below visualises each shape’s resource-over-time curve. Select one shape to inspect its five classification dimensions. Select two to compare them side by side.

Workload Shape Explorer

Select one shape to inspect it. Select two to compare their classification dimensions side by side.

Training

Large · GPU-heavy · Hours–days · Resumable

Resource profile: GPU-heavy (8–64+ GPUs). Multi-node common at scale. Large dataset reads and checkpoint writes. RDMA fabric is a first-class resource for all-reduce operations.
Latency budget: None on individual steps. Wall-clock matters for experiment iteration speed. No user-facing SLA.
Lifecycle: Submit → gang-scheduled (all-or-none) → all pods ready → training begins → resumable from checkpoint → model artifact written to registry.
Primary failure: Single-node failure halts the entire run. OOM mid-epoch. Data path saturation. NCCL communication hangs. Preemption by higher-priority workload.
Scheduling fit: Gang scheduling required (PodGroup + min-available). Quota-aware queuing (Kueue). GPU node selectors. Implementations: Volcano, KEP-4671 gang scheduling plugin.

Select a second shape to compare dimensions side by side.

Training

A from-scratch or large-scale training run. Production examples: pre-training an in-house foundation model, training a vision model on a large image corpus, training a recommender on a multi-billion-row event table.

Resource profile

GPU-heavy (8–64+ GPUs is typical for serious training runs), with multi-node jobs being common at scale. CPU and RAM requirements are moderate relative to GPU. Storage reads are large (datasets from object store or a parallel file system); checkpoint writes are large and must be durable. For multi-node runs, the inter-node fabric is a first-class resource: NCCL uses GPUDirect RDMA over InfiniBand or RoCE to execute all-reduce operations at 11 GB/s per IB EDR / 100GbE adapter [1]. Without RDMA, bandwidth is capped at roughly half the PCIe bus bandwidth, turning a two-week training run into a months-long one on large clusters.

Latency budget

None on individual steps. Wall-clock matters for iteration speed — a team that finishes an experiment in 12 hours can run twice as many experiments as one that takes 24. There is no user-facing SLA.

Lifecycle

Submit job → gang-scheduled (all N pods allocated or none) → all pods ready → training begins. Execution is resumable from checkpoint; multi-hour to multi-day runtimes are normal. Completion produces a model artifact and metrics written to a model registry.

Failure modes

Single-node hardware failure on a 16-node job halts the entire run; OOM mid-epoch (gradient accumulation or batch size misconfigured); data path saturation (the storage tier cannot feed the GPUs); NCCL communication hangs (mismatched collective timeout, topology mismatch, or RoCE misconfiguration); preemption by a higher-priority workload. Resumability from a recent checkpoint is what makes any of this tractable — without it, every failure restarts from epoch zero.

Scheduling fit

Gang scheduling is a correctness requirement for synchronous distributed training: the job needs all N pods simultaneously, or the pods that do launch sit idle consuming GPU memory while waiting for peers, producing deadlock [2]. The Kubernetes gang scheduling concept (PodGroup + a min-available count) enforces all-or-nothing semantics natively — implementations include Volcano and the upstream KEP-4671 gang scheduling plugin. Quota-aware queuing (Kueue)adds fairness across tenants and supports cohort borrowing when one team’s quota sits idle. GPU node selectors pin the job to the correct SKU.

Fine-tuning

Fine-tuning is training-shaped but smaller. Production examples: per-customer LoRA adapter on a base LLM, retraining a classifier on the most recent week of data, domain-adapting an embedding model after corpus drift.

Resource profile

Often single-GPU. Parameter-efficient methods like LoRA and QLoRAfit a 7B-parameter model fine-tune onto a single 24–80 GB GPU — the adapter represents a small fraction of the base model’s parameter count. Full fine-tunes of larger models (13B+) need multi-GPU. Memory budget is tighter than from-scratch training because the adapter weights, the frozen base model, and the optimiser state all compete for VRAM simultaneously.

Latency budget

None. Cadence matters — fine-tunes are often triggered by a drift signal or on a schedule (daily, weekly). Run time is typically minutes to a few hours, which makes retries cheap relative to training.

Lifecycle

Triggered (drift signal or schedule) → executed (minutes to hours) → evaluated against a held-out validation set → registered in the model registry → promoted via the registry lifecycle states (staging → champion).

Failure modes

Proportionally less painful than training failures because runs are shorter. The most common operational failure is hyperparameters that worked on the base model breaking on the fine-tuned variant — learning-rate sensitivity is much higher with LoRA than full fine-tune. Gradient explosion on a poorly configured adapter is also common.

Scheduling fit

Gang scheduling is not required for single-GPU fine-tunes. Quota-queued via a batch queue scheduler. GPU-sharing (MIG partitions or time-slicing) is often appropriate because the fine-tune is small and a full GPU would be underutilised — using a smaller partition frees the rest for concurrent online inference.

Batch inference

Scoring a large input set offline. Production examples: nightly enrichment of a customer record table, weekly relevance scoring of a content corpus, batch image classification of a backlog, embedding generation for a new document collection.

Resource profile

Throughput-bound. GPU-heavy when the model is heavy (LLMs, large vision models); CPU-only when the model is light (classical ML, small classifiers). Storage dominates non-GPU cost: large input reads from object store, large output writes back. Parallelism comes from sharding the input — each worker processes a slice of the corpus independently.

Latency budget

None on individual requests. Total wall-clock matters — the batch must complete within its maintenance window (e.g. before the morning report runs, before the next day’s model training ingests the enriched features).

Lifecycle

Scheduled or triggered (a workflow orchestrator fires on a cron, a completion event, or a data-arrival signal) → workers spin up, each consuming one shard → outputs written back to object store → workers spin down → downstream pipeline reads the outputs.

Failure modes

Shard failure mid-batch — requires idempotent shard processing so the failed shard can be retried without double-writing. Output store saturation (object store throttling or a full PVC). GPU OOM on an outlier-sized input (a single very long document can blow the batch buffer). Silent partial completion: the batch appears done but some shards were skipped — output validation after completion is mandatory.

Scheduling fit

Default Kubernetes scheduler with a GPU node selector. Quota-aware queuing for fairness. Workflow orchestration (e.g. Argo Workflows, Ray Batch) manages the shard fan-out and fan-in. Schedule for off-peak hours to use GPUs that would otherwise sit idle between training and serving peaks — this is the primary utilisation argument for batch inference.

Online inference

Real-time scoring of user-facing requests. Production examples: a recommendation API, a fraud-detection scorer, an LLM chat endpoint, an image moderation pipeline.

Resource profile

GPU or CPU depending on model. For LLM serving, memory is dominated by model weights and the KV cache — the intermediate attention key/value tensors that grow with sequence length. Efficient KV cache management (as described in the PagedAttention approach introduced in the vLLM paper [3]) reduces memory fragmentation from ~60–80% to under 4%, directly enabling larger effective batch sizes and higher throughput on the same hardware.

Latency budget

Tight. The convention used in this series: under 100 ms p99is “online”; under 1 s is “interactive”; multi-second is “batch”. LLM streaming endpoints have a different shape: time-to-first-token (TTFT) matters more than end-to-end latency because the user perceives the stream starting. An elevated TTFT typically signals queuing delay or KV cache pressure.

Lifecycle

Long-running deployment (days to months, not minutes). Autoscaling on traffic (HPA on RPS) or on queue depth and KV-cache utilisation — the latter is the correct signal for LLM serving because CPU utilisation is almost never the bottleneck [4]. The vllm:num_requests_waiting per-replica metric, exposed via Prometheus and consumed by KEDA, is the idiomatic autoscaling trigger. Rollouts via canary, shadow, or A/B deployment.

Failure modes

Cold-start latency on scale-up (model weight loading typically takes tens of seconds to several minutes depending on model size and storage tier — a 70B model loaded from object storage can take two minutes or more before the replica is ready to serve, which breaks any sub-second SLA if the autoscaler triggers too late [6]). Traffic-driven OOM from KV cache exhaustion under long-context requests. GPU-share contention when a co-located training job takes memory. Upstream model registry unavailability blocking new replicas from loading weights during a scale-out event.

Scheduling fit

Default Kubernetes scheduler with a GPU node selector. KEDA for queue-driven or custom-metric autoscaling — the idiomatic pattern for vLLM is a KEDA ScaledObject targeting the vllm:num_requests_waiting Prometheus metric, as documented in the Red Hat Developer guide for KServe+KEDA [6]. Serving runtimes optimised for LLM workloads (e.g. vLLM, Triton with TensorRT-LLM) handle continuous batching and KV cache management internally.

Agent workloads

An agent is an LLM-based system that dynamically directs its own processes and tool usage across multiple steps [5]. Unlike a single-call inference endpoint, the agent decides at runtime which tools to call, in what order, and when to stop — the control flow is not pre-determined. Production examples: a customer-support copilot that queries a knowledge base and a ticketing API, a coding assistant that iterates on a tool-use loop, a research assistant that plans then retrieves.

Resource profile

Mixed and asymmetric. The orchestrator (routing logic, prompt assembly, state management) is CPU-bound and lightweight. The LLM calls are GPU-bound or hosted-API-bound. Retrieval steps are vector-DB-bound or search-service-bound. Memory: KV cache for the active context window, retrieval cache for repeated lookups, and session state if the agent maintains conversation history across turns.

Latency budget

Loose on total wall-clock (multi-second to tens of seconds is common and accepted for complex tasks); tight on time-to-first-token for streaming responses; tight on individual tool call round-trips (a 500 ms tool call in a 10-step chain adds 5 s). Step-level latency budgeting is what matters — the total is the sum of the steps, and each slow step compounds.

Lifecycle

Long-running serving deployment. Each user request spawns a session that may span multiple LLM calls, tool calls, and retrieval steps. The agent must acquire ground truth from the environment at each step — tool call results, retrieved documents, code execution outputs — to assess its progress and decide next actions [5]. Some sessions last seconds; others (autonomous research tasks) last minutes. Per-session state may need to persist across HTTP boundaries.

Failure modes

A tool call that fails silently or returns malformed output (the agent continues on bad state). Retrieval returning no results (the agent hallucinates instead of surfacing the gap). The LLM generating a syntactically invalid tool invocation. The agent looping without converging — infinite retries exhaust the context window and run up token cost. These are primarily observability failures: the system runs but produces wrong answers, and without per-step tracing you cannot tell where the chain broke.

Scheduling fit

Default Kubernetes scheduler. KEDA on session count or request queue depth. Network latency between the orchestrator and the model-serving endpoint matters at every step — co-locate them or ensure sub-millisecond intra-cluster networking. Observability requirements are higher than for any other workload shape: you need per-step traces, token counts per call, tool-call success rates, and session-level error aggregation.

Decision table: shape → scheduling primitive and runtime

For a workload that fits a named shape, this is the default Kubernetes scheduling primitive and serving runtime. On-prem is a fixed-GPU cluster with sunk hardware cost; cloud-managed is an elastic cluster where you pay per GPU-hour.

Shape	Scheduler	Queue / fairness	Runtime	Cluster preference
Training (multi-node)	Gang (all-or-none)	Quota-aware (Kueue)	PyTorchJob, MPIJob	On-prem (sunk GPU cost)
Training (single-node)	Default	Quota-aware (Kueue)	PyTorchJob or direct Pod	On-prem
Fine-tuning (LoRA)	Default	Quota-aware (Kueue)	Training job or workflow DAG step	On-prem (or shared GPU)
Batch inference (GPU)	Default	Quota-aware (Kueue)	Workflow orchestrator + custom container	On-prem default / cloud burst
Batch inference (CPU)	Default	Optional	Workflow orchestrator or Ray Batch	Either
Online inference	Default	None (HPA / KEDA)	vLLM, Triton, KServe	Where latency budget allows
Agent workload	Default	None (KEDA / HPA)	Custom orchestrator + LLM runtime	Co-locate orchestrator + model

The dominant pattern: training and fine-tuning prefer an on-prem cluster where GPU cost is sunk; online inference goes wherever the latency budget is met; batch inference goes where the data lives unless burst capacity is cheaper. Agent workloads are unusual in that the orchestrator and the model serving tier should be co-located — each additional network hop adds latency that compounds across steps.

How to classify a workload you haven’t seen before

Five questions, asked in order:

1Does it produce a new model artifact? If yes → training or fine-tuning. If the output artifact is the same architecture as the input base → fine-tuning. Otherwise → training.
2Is there a tight per-request latency budget? If yes → online inference or agent. If no → batch inference.
3Does the request fan out into tool calls, retrieval, or multi-step reasoning? If yes → agent. If no → online inference.
4What is the resource shape — GPU-heavy and multi-hour, or GPU-light and short? Multi-hour + GPU-heavy disambiguates training from fine-tuning; duration + GPU intensity disambiguates batch from online.
5What is the failure mode that hurts most — wall-clock time, throughput, or latency? This identifies the scheduling primitive (gang, queue, default) and the observability dimension that matters.

Safe default: If you are still unsure after five questions, default to treating the workload as batch inference: queue it, give it a GPU node selector, write idempotent shard logic, and add per-shard output validation. Batch inference is the safest default because it makes no availability guarantees (so you won’t break an SLA) and its scheduling needs are minimal (so you won’t mis-configure a gang scheduler). Upgrade the classification once you observe the actual failure modes in staging.

The classifier below walks through the five questions interactively and routes to the correct shape with the scheduling-fit guidance from the article.

Workload Classifier

Answer the article’s five classification questions in order to route your workload to one of the five canonical shapes.

Question 1

Does the workload produce a new model artifact?

If the primary output is model weights or an adapter — yes. If it scores or transforms existing data — no.

References

[1] NVIDIA Technical Blog, “Scaling Deep Learning Training with NCCL”, NVIDIA, 2019. developer.nvidia.com
[2] Kubernetes.io, “Gang Scheduling” (concept documentation), Kubernetes project, 2025. kubernetes.io
[3] Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention”, SOSP 2023 (Best Paper). arxiv.org/abs/2309.06180
[4] KServe project, “Autoscaler” (documentation), KServe, 2025. kserve.github.io
[5] Anthropic, “Building Effective Agents”, December 2024. anthropic.com/research/building-effective-agents
[6] Red Hat Developer, “How to set up KServe autoscaling for vLLM with KEDA”, 2025. developers.redhat.com
[7] Kubernetes SIG-Scheduling, KEP-4671 “Gang Scheduling”, kubernetes/enhancements, 2024–2025. github.com/kubernetes/enhancements

Continue the Journey

AI Platform

The five canonical shapes

Workload Shape Explorer

Training

Training

Resource profile

Latency budget

Lifecycle

Failure modes

Scheduling fit

Fine-tuning

Resource profile

Latency budget

Lifecycle

Failure modes

Scheduling fit

Batch inference

Resource profile

Latency budget

Lifecycle

Failure modes

Scheduling fit

Online inference

Resource profile

Latency budget

Lifecycle

Failure modes

Scheduling fit

Agent workloads

Resource profile

Latency budget

Lifecycle

Failure modes

Scheduling fit

Decision table: shape → scheduling primitive and runtime

How to classify a workload you haven’t seen before

Workload Classifier

References

Continue the Journey

Training workloads on Kubernetes — operators, gang scheduling, and checkpointing

Inference workloads — batch vs online, latency budgets, and where the serving runtime sits

The GPU scheduling stack: queue admission, gang scheduling, and hardware abstraction