AI Platform Engineering & MLOps · Part XI of 34

Inference workloads — batch vs online, latency budgets, and where the serving runtime sits

Batch and online inference have opposite optimisation functions. This article covers resource shape, latency metrics, autoscaling signals, and how to size GPU capacity for a target p95.

12 min read·2 interactive components·8 references

Batch lane — wide, queue-fed, preemptibleOnline lane — narrow, low-latency, warm-poolGPU Bank (shared)

Every model you train eventually has to answer a request. How it answers that request — at what speed, at what cost, under what scheduling constraints — is the inference problem. Inference is not a single workload shape: it splits into two families with opposite optimisation functions. Batch inference maximises throughput and minimises cost per token. Online inference minimises latency and maximises availability. Getting the two confused is one of the more expensive mistakes a platform team can make, because the infrastructure decisions that are correct for one are often wrong for the other.

This article covers how to tell them apart, what resource shapes they need, which autoscaling signals are reliable and which are not, and how to work backwards from a latency SLO to a GPU count.

The dividing line

Two questions settle the classification:

Is there a per-request latency budgeta human or synchronous system will notice? If yes → online.
Is the full input corpus known before the job starts? If yes → batch.

A practical taxonomy: under 100 ms p99 is online (real-time API), under 1 second is interactive (chatbot, autocomplete), multi-second is batch-tolerant. Streaming LLM generation is a special case: what matters is time-to-first-token (TTFT), not total response time, because the user sees the stream start. A TTFT under one second is the online bar for streaming endpoints.

Batch vs online — a comparison across five dimensions

The table below covers the five dimensions that matter operationally:

Dimension	Batch inference	Online inference
Optimisation target	Throughput (tokens/sec, samples/hour)	Latency (TTFT, p95 end-to-end)
Input availability	Full dataset known before job starts	Requests arrive one at a time, asynchronously
Scheduling	Quota-queued, schedulable off-peak, preemptible	24×7, non-preemptible, warm-pool required
Cost profile	GPU runs at high utilisation for a bounded window; ~5–10× cheaper per token vs online at typical utilisation [6]	GPU runs continuously; cost dominated by idle capacity between request bursts
Result delivery	Written to object storage / database when complete	Returned in the HTTP response or streamed token-by-token
Autoscaling signal	Input queue depth (items pending)	Request queue depth, KV-cache saturation — not GPU utilisation [4]

The per-token latency model for LLM serving

Classical model serving has one latency number: end-to-end response time. LLM serving has three. The Kwon et al. paper that introduced PagedAttention (SOSP 2023) [1] defines the production-relevant metrics:

Time to first token (TTFT): the gap between the request arriving and the first output token being emitted. This is latency as the user perceives it in a streaming interface. TTFT is dominated by the prefill phase, which scales with prompt length and model size.
Time per output token (TPOT): inter-token delay during the decode phase. This is the streaming smoothness metric — a high TPOT means tokens arrive in bursts. TPOT is constrained by the memory bandwidth available to move KV-cache data per decode step.
End-to-end latency: TTFT + (output_tokens × TPOT). This is the number that matters for non-streaming callers and for SLO contracts.

PagedAttention addresses the memory-management inefficiency that constrains both TPOT and throughput. Earlier LLM serving engines allocated KV-cache memory as a contiguous tensor per request — the allocation happened at the start, based on the maximum possible output length, causing 60–80% of reserved GPU memory to go unused for typical requests [1]. PagedAttention borrows the virtual-memory model from operating systems: the KV cacheis split into fixed-size pages that are allocated and deallocated on demand. The Kwon et al. paper reports a 2–4× throughput improvement over prior systems (FasterTransformer, Orca) on the same hardware, driven primarily by this reduction in memory waste.

Continuous batching compounds the gain. Traditional request-level batching waits for all sequences in a batch to finish before starting new ones — long requests block short ones. Continuous batching admits new requests at the token level: after each decode step, completed sequences are evicted and new sequences join [1][3]. The result is that throughput scales with concurrency rather than plateauing at the longest-request latency. The serving runtime implementations that use these techniques (e.g. vLLM, Text Generation Inference) expose this behaviour by default.

The builder below decomposes a p95 end-to-end budget across the four stages of a serving path — gateway, queue wait, prefill, and decode — and shows whether your configuration fits the budget and which latency class it lands in.

Latency Budget Builder

Decompose a p95 end-to-end budget across the four stages of an LLM serving path. Decode time is computed as output tokens × TPOT; TTFT is everything before the first token.

Total p95 budget3000 msGateway overhead15 msQueue wait20 msPrefill / TTFT300 msOutput tokens200 tokTPOT10 ms/tok

0 ms3.00 s

white marker = budget (3.00 s)

Gateway — 15 ms. Auth, rate-limiting, and routing at the inference gateway.

Queue wait — 20 ms. Time spent waiting for a free slot in the serving runtime batch.

Prefill (TTFT) — 300 ms. Prompt processing — dominates time-to-first-token.

Decode (tokens × TPOT) — 2.00 s. Token-by-token generation, constrained by memory bandwidth.

Within budget — end-to-end p95 is 2.33 s against a 3.00 s budget (665 ms headroom).

Latency class: Batch-tolerant (multi-second).

TTFT is 335 ms — meets the under-one-second online bar for streaming endpoints.

Resource shape

Online inference is GPU-memory-bound, not GPU-compute-bound. The primary constraint is fitting the model weights and the KV cache into GPU VRAM. A rough sizing formula:

VRAM required ≈ (model_params_B × bytes_per_param) + (batch_size × context_length × layers × head_dim × 2 × bytes_per_param)

For a 7-billion-parameter model in FP16 (2 bytes per parameter), the weights alone consume roughly 14 GB of VRAM. An A100 80 GB GPU has headroom for both weights and a generous KV-cache allocation. A 7B FP16 model on an A100 80 GB achieves approximately 80–100 tokens per second at batch sizes of 8–16 with short sequences, based on published benchmarks using TensorRT-LLM [8]. At a target p95 of 250 ms end-to-end for a 200-token response, that implies a TPOT budget of roughly 1.2 ms — achievable at that throughput level.

Worked example: suppose you need to sustain 500 concurrent streaming users, each generating 200-token responses, with a p95 TTFT under 800 ms and p95 TPOT under 30 ms. At 90 tokens/sec throughput per replica, each replica handles approximately 4–5 concurrent streams at the TPOT budget. To sustain 500 concurrent users you need roughly 100–125 replicas, i.e. 100–125 A100 GPU-hours. In practice, actual concurrency peaks drive that number — most deployments see a 10:1 peak-to-average ratio, so a steady-state pool of 15–20 replicas with autoscaling to 125 is the practical answer. These numbers vary significantly with quantisation, sequence length, and hardware generation; treat them as an order-of-magnitude planning input, not a deployment specification.

Batch inference has a different profile. VRAM requirement is the same (the model still lives in GPU memory), but the batch is assembled from a pre-staged corpus and runs at maximum GPU utilisation until the corpus is exhausted. CPU and storage bandwidth are the secondary constraints — reading the input corpus fast enough to keep the GPU fed. Network bandwidth matters if the corpus lives in object storage rather than a local PVC.

Autoscaling signals: what to use, what to avoid

The central mistake is autoscaling GPU inference pods on CPU or GPU utilisation. GPU utilisation tends to be bimodal under load: it reads near zero between requests and near 100% during a forward pass. Because GPU forward passes are short relative to the wall-clock interval the HPAsamples at (typically 30–60 seconds), the utilisation metric arrives late and triggers scale events after the latency damage is done. KServe’s documentation and Google’s GKE best practices both recommend against CPU/memory as primary autoscaling indicators for GPU inference workloads [4].

Reliable signals, in priority order:

1Request queue depth: the number of requests waiting for a GPU. For runtimes that expose it (e.g. vLLM exports vllm:num_requests_waiting [4]), this is the earliest available signal that capacity is exhausted. Queue depth is zero under normal load and non-zero only when the serving pool is saturated — a step function that HPA or KEDA can act on immediately.
2KV-cache saturation: when the KV cache fills (vllm:kv_cache_usage_perc approaching 1.0 [4]), new requests cannot be admitted without evicting existing KV cache pages. Scale before this threshold to avoid the latency spike that follows cache pressure. A threshold of 0.8 is a reasonable starting point.
3Request rate (for non-LLM online inference): requests per second measured by the serving runtime or the gateway. This leads GPU utilisation by several seconds and is a practical proxy when the runtime does not expose queue depth.

Do not scale on:

GPU utilisation — arrives too late, causes oscillation.
CPU utilisation — irrelevant for GPU-bound workloads; the CPU idles while the GPU computes.
Memory utilisation — GPU memory is statically allocated on pod start; it does not correlate with request load.

Cold-start cost and the warm-pool pattern

Online inference has a hard cold-start problem. When a pod starts, it must load model weights from storage into GPU memory before it can serve any request. For large models this is measured in tens of seconds to minutes: a 70B-parameter model (roughly 130 GB in FP16) loaded from object storage at 5 GB/s takes at minimum 26 seconds of pure I/O, not counting container startup, CUDA initialisation, and the first CUDA kernel compilation [5]. Deployment timelines above 11 minutes have been reported for full pull-extract-start sequences under typical conditions [5].

Scale-to-zero is therefore unsuitable for latency-SLA online inference at any model size where cold-start exceeds the acceptable TTFT. A warm-pool minimum is strongly recommended: keep at least one replica running at all times, autoscale up from that floor on the queue-depth signal, and scale down with a long cooldown (typically 2–5 minutes) to avoid thrashing on short traffic dips. The cost of one idle GPU pod is small relative to the latency penalty of a cold-start during a traffic spike.

Batch inference is immune to cold-start in the same sense: the job scheduler can absorb model-load time as part of the job startup budget, because the latency SLO is minutes or hours, not seconds.

Where the serving runtime sits

Online inference on Kubernetes is served by a long-running Deployment or an InferenceService abstraction on top of it. The serving runtimehandles request routing, batching, and GPU communication. The platform team’s job is to wire the runtime to the right autoscaler, set the resource limits, and expose the endpoint through a gateway.

Three patterns, by workload shape:

LLM online serving: a runtime with PagedAttention and continuous batching (e.g. vLLM, TGI) deployed as a Kubernetes Deployment, one replica per GPU, fronted by a gateway that does auth and rate-limiting. Autoscale with KEDA on request queue depth and KV-cache saturation. Accept the CNCF-incubating KServe project [7] as an optional abstraction layer on top if you want InferenceService-level canary and scale-to-zero semantics for non-SLA endpoints.
Non-LLM online serving (sklearn, XGBoost, ONNX, classical deep learning): a model-serving abstraction (e.g. KServe InferenceService, Seldon Core, BentoML) reduces boilerplate. These runtimes add dynamic batching, health checks, and traffic-splitting without custom code. Autoscale on request rate via HPA or KEDA.
Batch inference: a workflow orchestrator (e.g. Argo Workflows, Ray Batch) manages the job lifecycle. Workers read from an input queue, run inference, and write results to object storage. Pods are ephemeral — they start, process N shards, and exit. The autoscaler acts on the input queue depth; when the queue empties, the worker pool scales to zero.

The classifier below walks through the dividing-line questions interactively and routes to the correct serving pattern, with the autoscaling and cold-start guidance from this article.

Serving Path Classifier

Answer the dividing-line questions to route your workload to one of the three serving patterns — and see which autoscaling signals to wire up and which to avoid.

Question 1

Is there a per-request latency budget a human or synchronous system will notice?

A user or upstream service waits for the response — under 100 ms p99 is online, under 1 s is interactive. For streaming LLMs the bar is TTFT under one second.

Multi-model serving on one runtime

Running multiple models on a single serving process (multi-model serving) reduces the per-model overhead — one container, one GPU allocation, many models. The pattern makes sense under two conditions: models are small enough to share VRAM without contention, and the traffic pattern to each model is too light to justify a dedicated replica.

The practical constraints are significant. First, if any one model saturates the GPU, all co-located models experience latency degradation — the GPU is shared at the kernel-scheduling level, not at the memory level. Second, models in a multi-model server must share the same serving runtime and framework. A mixed portfolio of ONNX and PyTorch models can coexist in a runtime that supports both (Triton Inference Server handles this with its model repository concept), but LLMs and classical models cannot share a PagedAttention-based runtime without specialised support.

The case against multi-model serving for SLA-bound endpoints: it introduces coupling between independent model lifecycles. A bad model version, a memory leak, or a traffic spike from one model affects all co-tenants on the same runtime. For SLA-bound online inference, single-model deployments with dedicated GPU partitions or dedicated replicas are the safer default.

Common pitfalls

Autoscaling on GPU utilisation. GPU utilisation is a trailing indicator for latency and causes the HPA to react after the queue has already grown. Switch to queue depth or KV-cache saturation.
No warm-pool minimum for SLA-bound online endpoints. Scale-to-zero is economically appealing but operationally costly for large models. The first request after a cold scale-up is served with a latency that exceeds most SLOs.
Training and online inference sharing a GPU node. Training jobs saturate GPU memory and compute; any co-resident inference replica will see latency spikes. Keep them on separate node pools enforced by taints and node selectors.
Hard-coded model weights in the container image. This couples model lifecycle to image lifecycle and defeats model registry lineage. Weights should be pulled from a versioned artifact store at pod startup via a CSI driver or init container.
Treating batch and online as the same scheduling class. Batch jobs are preemptible and queue-managed; online replicas are not. Mixing them in the same Kubernetes namespace without quota boundaries allows a batch surge to evict or starve online inference pods.

References

[1] Kwon W. et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention.” SOSP 2023. arXiv:2309.06180. arxiv.org/abs/2309.06180
[2] Anyscale documentation, “LLM Serving Benchmarking Metrics” (TTFT, TPOT, end-to-end latency definitions). docs.anyscale.com
[3] Anyscale blog, “Achieve 23x LLM Inference Throughput” (continuous batching explainer). anyscale.com/blog/continuous-batching-llm-inference
[4] KServe documentation, “LLM Autoscaling with KEDA.” kserve.github.io — vLLM metrics reference: docs.vllm.ai/en/stable/design/metrics
[5] ScaleOps, “Reducing GPU Cold Start Times in Kubernetes: Patterns and Solutions” (LLaMA-2-70B load time example). scaleops.com
[6] Spheron, “Batch LLM Inference on GPU Cloud” (cost-per-token comparison, online vs batch). spheron.network — Anyscale batch announcement (6× cost reduction figure): anyscale.com/blog/batch-llm-inference-announcement
[7] CNCF, “KServe becomes a CNCF incubating project,” November 11 2025. cncf.io
[8] Inferless, “Exploring LLMs Speed Benchmarks — Part 2” (Llama2-7B A100 throughput, TensorRT-LLM). inferless.com

Continue the Journey

AI Platform