AI Platform Engineering & MLOps · Part XIII of 34

Serving patterns for production ML — runtimes, routing, and the autoscaling signals that matter

Four serving runtime families, the predictor/transformer/explainer pipeline pattern, routing-layer trade-offs, and the autoscaling signals that hold up under real inference traffic.

12 min read·2 interactive components·9 references

General-purpose platformLLM-specialised engineFramework-nativePython-first embedded

Shipping a trained model is not deployment complete. A model sitting in a registry is not serving traffic. The runtime that receives requests, dispatches them to GPU or CPU, and returns responses is as much a part of your ML system as the training pipeline that built the model. Most teams underinvest here — they pick a serving stack once and never revisit it as the portfolio grows.

This article maps four distinct serving runtime families, explains the predictor / transformer / explainer pipeline pattern that generalises across them, covers the routing layer (gRPC vs REST vs streaming), and finishes with the autoscaling signals that actually correlate with inference load — not the ones Kubernetes defaults to. Where relevant it cross-references the workload-taxonomy distinctions introduced in Part 3 of this series.

Why serving is a distinct engineering discipline

Training optimises for throughput on a fixed dataset. Serving optimises for latency, concurrency, and cost on an unpredictable request stream. The two have opposite resource profiles: training wants to saturate the GPU for hours; online serving wants to return in milliseconds while leaving headroom for the next request. These different objectives drive different runtime choices.

A second distinction matters: serving a classical ML model (a scikit-learn classifier, an XGBoost ranker, an ONNX-exported neural network) and serving a large language model share a Kubernetes substrate but differ substantially in memory management, batching semantics, and autoscaling signals. Conflating them produces systems that are badly tuned for both.

The four serving runtime families

Four runtime families cover the space of production ML serving on Kubernetes. Each has a clear fit; none is universally correct.

Family 1 — General-purpose serving platforms

These platforms provide a Kubernetes-native abstraction — typically a Custom Resource Definition — that handles the full model endpoint lifecycle: storage loading, runtime selection, traffic splitting, autoscaling, and rollback. The canonical examples in this family are KServe and Seldon Core. KServe was accepted as a CNCF incubating project in November 2025, giving it neutral governance aligned with the broader cloud-native ecosystem.

KServe’s InferenceService CRD encodes the storage source (S3 URI, PVC, OCI image), the serving runtime (any of the families below), the traffic split, and the autoscaling policy as a single Kubernetes resource. Its built-in support for scale-to-zero via KNative makes it cost-effective for portfolios of low-traffic endpoints — a GPU pod that sits idle can return its card to the pool. Seldon Core v2 extends the pattern with composable inference pipelines (route, transform, explain, monitor) wired as directed acyclic graphs.

Fit: mixed model portfolios (sklearn, ONNX, XGBoost, PyTorch), multi-tenant serving, canary/shadow rollout requirements, or any case where the number of endpoints exceeds what a team can manage as individual Deployments.

Family 2 — LLM-specialised inference engines

Large language models expose memory and scheduling constraints that general-purpose runtimes were not designed for. KV-cache fragmentation, the variable-length decode loop, and the need to merge in-flight requests from different users all require specialised solutions. The leading runtimes in this family are vLLM and HuggingFace TGI.

vLLM introduced PagedAttention — KV-cache allocation in fixed-size virtual memory pages rather than contiguous tensors — which reduces GPU memory waste from 60–80% (in naive implementations) to under 4% [Kwon et al., SOSP 2023]. Combined with continuous (iteration-level) batching — where new requests join the batch after each forward pass rather than waiting for a batch boundary — vLLM achieves 2–4× higher throughput than prior state-of-the-art systems such as FasterTransformer and Orca at equivalent latency levels. The improvement is more pronounced under long sequences and complex decoding algorithms.

A 2025 comparative study [arXiv:2511.17593] found that vLLM leads on throughput under high-concurrency workloads while HuggingFace TGI shows an advantage in low-concurrency, single-user latency scenarios. Neither dominates across all workload shapes: choose based on your concurrency profile.

vLLM speaks the OpenAI Chat Completions API by default, so clients written against a commercial LLM API work without modification. It publishes Prometheus metrics (num_requests_running, num_requests_waiting, time_to_first_token_seconds) that are essential inputs to the autoscaling recipes described later.

Fit: self-hosted transformer LLMs (instruction-tuned or fine-tuned) at moderate to high concurrency. Use TGI if you operate primarily within the HuggingFace ecosystem and your workload is latency-dominated at low concurrency; use TensorRT-LLM + Triton for the lowest raw latency on NVIDIA hardware when you can afford the per-model engine-compilation overhead.

Family 3 — Framework-native runtimes

Some runtimes are optimised for a specific framework’s model format and deployment semantics rather than for general portability. TorchServe is the officially supported PyTorch model server (maintained by AWS and Meta in collaboration). It packages models as MAR archives, supports gRPC and REST, dynamic batching, multi-model hosting, and model versioning. NVIDIA Triton Inference Server occupies similar territory but across a wider framework surface: PyTorch, TensorFlow, ONNX, and TensorRT all sit behind one API, with concurrent model execution and model ensemble support.

Fit: TorchServe when the organisation has standardised on PyTorch and wants the reference deployment path from the framework authors. Triton when the portfolio spans multiple frameworks that need to share GPU resources efficiently, or when TensorRT-accelerated INT8/FP16 inference is already part of the pipeline.

Family 4 — Python-first embedded runtimes

When the serving artefact is a Python class with embedded preprocessing, postprocessing, and business logic rather than a pure model file, frameworks such as BentoML provide a better fit than the model-file-centric runtimes above. BentoMLpackages the serving logic, model weights, and dependencies into a self-contained Bento unit that runs identically locally and in a Kubernetes pod — removing the “works on my machine” friction from productionisation.

BentoML supports adaptive micro-batching (grouping concurrent requests without a fixed wait time), multi-framework model loading, and a runner abstraction that separates inference compute from the HTTP serving layer. This means the same code can serve locally at development time and scale horizontally on Kubernetes in production.

Fit: Python-heavy serving logic, teams that prefer co-locating preprocessing/postprocessing with the model, or projects where local-to-production parity matters more than cross-framework model portability.

Decision table — which runtime fits which workload

The following criteria map to the workload taxonomy established in Part 3 (article 11) of this series. Apply them in order: the first matching row wins.

Workload shape	Runtime family	Concrete option(s)
Transformer LLM, moderate-to-high concurrency	LLM-specialised	vLLM (high throughput) or TGI (low-latency single-user)
Multi-framework portfolio, TensorRT acceleration needed	Framework-native	Triton Inference Server
PyTorch-only models, organisation already standardised on PyTorch	Framework-native	TorchServe
Mixed portfolio (sklearn, ONNX, XGBoost, Hugging Face), multi-tenant, canary/shadow needed	General-purpose platform	KServe (CNCF-aligned) or Seldon Core (inference graphs)
Python-heavy serving logic, preprocessing embedded in application code	Python-first embedded	BentoML
Batch inference (throughput-bound, no latency SLO)	Any of the above behind a queue worker	KEDA-driven scaling on queue depth

The composer below applies the decision table interactively: pick a workload profile and it assembles the recommended runtime family, wire protocol, autoscaling signal, and the request pipeline from client to GPU.

Serving Stack Composer

Pick a workload profile. The composer assembles the runtime family, wire protocol, and autoscaling signal the decision table recommends — first matching row wins.

Request pipeline — Transformer LLM endpoint

Client

OpenAI SDK

Gateway

SSE-aware

Transformer

prompt template · guardrails

Predictor

vLLM engine

GPU

warm pool ≥ 1

Runtime family: LLM-specialised inference engine
Concrete runtime: vLLM (high throughput) or TGI (low-latency single-user)
Wire protocol: REST (OpenAI-compatible) + SSE token streaming
Autoscaling signal: KEDA on vllm:num_requests_waiting (queue depth) — a leading indicator
Why this stack: PagedAttention keeps KV-cache waste under 4% and continuous batching merges in-flight requests after every forward pass — 2–4× the throughput of runtimes without iteration-level scheduling. Keep a warm-pool minimum of 1–2 replicas: cold-starting multi-gigabyte weights takes 30–120 s.

The predictor / transformer / explainer pipeline pattern

KServeformalises a three-component pipeline pattern that generalises cleanly across serving scenarios. The same logical decomposition appears (under different names) in Seldon Core’s inference graphs and Triton’s ensemble backends.

The three components, as documented by the KServe project:

Predictor — the core component: a model and the model server that exposes it at a network endpoint. The only required component.
Transformer — handles preprocessing (raw request → model input tensor) and postprocessing (model output tensor → client-friendly response). Deployed as a separate container; can scale independently from the predictor.
Explainer — provides an alternate data plane returning model explanations alongside (or instead of) predictions. Optional; configures with the prediction endpoint as an environment variable so it can call the predictor internally.

The diagram below shows the two request paths — prediction and explanation — and the role of the transformer in both:

The practical value of this decomposition is operational independence: the transformer can be updated (e.g. a tokenisation change) without redeploying the predictor, and the predictor can be updated (e.g. a new model version) without touching the transformer. For LLM endpoints, the transformer pattern is where prompt templating, guardrail pre-checks, and output parsing belong — keeping the predictor as a pure inference engine.

The routing layer: gRPC, REST, and streaming

Every serving runtime supports at least HTTP/REST. Most support gRPC. LLM runtimes also support streaming responses (server-sent events or WebSocket). Choosing between them is a function of the client, the payload shape, and the latency requirements.

gRPC uses binary Protocol Buffer framing and persistent HTTP/2 connections. The structural advantage over JSON/REST is elimination of per-request connection setup and smaller payload size for tensor data — both of which matter when the serving pod is handling thousands of in-flight requests. For models returning small classification scores or embeddings, the difference is detectable; for LLMs where the model compute dominates, the transport becomes secondary.

REST (JSON)is the pragmatic choice for external-facing endpoints and where client diversity makes gRPC stubs impractical. vLLM’s OpenAI-compatible REST API is the de facto standard for LLM endpoints precisely because it maximises client ecosystem compatibility — existing SDKs, proxies, and observability tooling all understand it.

Streaming (server-sent events) is non-negotiable for interactive LLM applications. A user-facing chat interface has a hard dependency on token-by-token streaming — the alternative (waiting for the full completion) produces an unusable UX at typical LLM generation lengths. Both vLLM and TGI support the OpenAI streaming protocol out of the box.

The routing layer that sits in front of serving replicas (Envoy, an Inference Gateway, or the service mesh sidecar) must understand these three protocols to load-balance correctly. gRPC requires HTTP/2-aware load balancing; naive L4 TCP load balancing distributes connections, not requests, and a single long-lived gRPC connection to a slow backend stalls the client.

Autoscaling signals that hold up under real inference traffic

The default Kubernetes HPA target — CPU utilisation — is a poor proxy for inference load. GPU utilisation is worse: it is bimodal under batch inference (0% between batches, 100% during the forward pass), which causes the HPA to oscillate. The signals that correlate with actual user-visible load are application-level.

For classical ML endpoints (KServe, TorchServe, Triton): scale on request rate or concurrent in-flight requests exposed via Prometheus. The Prometheus Adapter bridges these application metrics to the Kubernetes HPA external metrics API. Set a conservative scale-down cooldown (2–5 minutes) to avoid thrash on short traffic dips.

For LLM endpoints (vLLM, TGI): KEDA with a Prometheus scaler is the recommended pattern. vLLM exports vllm:num_requests_running and vllm:num_requests_waiting metrics out of the box. Scaling on num_requests_waiting (queue depth) is a leading indicator: it rises before throughput saturates and before time-to-first-token degrades visibly. The KServe + KEDA integration is documented by the KServe project and tested in production by multiple organisations.

One hard constraint for GPU-backed online inference: maintain a warm-pool minimum. Cold-starting a pod that must load multi-gigabyte model weights from object storage can take 30–120 seconds. On a traffic spike, that cold-start cost falls inside the request path. Setting min-replicas to at least 1 (or 2 for SLA-bound services) keeps a warm pod available at all times; the cost of an idle pre-warmed GPU pod is bounded and predictable in a way that a 60-second cold-start under a user is not.

keda-scaledobject-vllm.yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-inference-scaler
  namespace: ml-serving
spec:
  scaleTargetRef:
    name: llm-deployment
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring.svc:9090
        metricName: vllm:num_requests_waiting
        query: sum(vllm:num_requests_waiting{namespace="ml-serving"})
        threshold: "5"

The explorer below plays the four candidate signals against three traffic shapes and shows which ones lead the load, which lag it, and which mislead the autoscaler entirely.

Autoscaling Signal Explorer

Choose a traffic shape and a candidate signal. The chart shows how the signal responds to the traffic; the verdict says whether an autoscaler driven by it would keep up.

Traffic shape

Autoscaling signal

Incoming trafficQueue depth

Leading indicatorvllm:num_requests_waiting (KEDA)

Queue depth rises before throughput saturates and before time-to-first-token degrades visibly — scaling on it adds replicas while the SLO is still intact. The recommended KEDA trigger for LLM endpoints.

On a smooth diurnal curve, set a conservative scale-down cooldown (2–5 minutes) so short dips do not thrash the replica count.

Multi-model serving trade-offs

As the number of deployed models grows, the one-Deployment-per-model pattern becomes operationally expensive. Every endpoint needs its own pod, its own GPU slice, its own health checks, and its own autoscaling configuration. Two alternatives exist:

Shared-GPU multi-model serving (Triton’s model repository pattern) — multiple models load into the same server pod and share GPU memory. Traffic routing happens at the model-name level inside the server. Reduces GPU idle time when models are lightly loaded but introduces a noisy-neighbour risk: a burst to one model can steal memory pages from others.
Scale-to-zero per model (KServe + KNative) — each model gets its own pod, but idle pods scale to zero and release the GPU. The serving platform provisions replicas on demand. Cold-start latency is the primary operational concern, managed by the warm-pool minimum described above.

The decision rule is traffic profile: if most models see irregular, bursty, low-average-rate traffic, scale-to-zero is usually more cost-effective because idle costs dominate. If models see sustained, predictable load, shared-GPU multi-model serving reduces total GPU count because models can share unused capacity.

Common pitfalls

Scaling on GPU utilisation. GPU utilisation is bimodal under batch inference and misleading for HPA. Use request queue depth or in-flight request count instead.
No warm-pool minimum for GPU-backed online serving. Cold-startinga model-loaded pod during a traffic spike puts the load time in the user’s critical path.
Using a general-purpose runtime for LLM serving without continuous batching. A runtime that issues one forward pass per request under LLM decoding workloads cannot approach the throughput of an engine with iteration-level scheduling.
L4 TCP load balancing in front of gRPC endpoints. Connection-level balancing sends all requests from a single client to a single backend. Use an HTTP/2-aware proxy (Envoy, Nginx, a service mesh) that balances at the request level.
Hard-coding model weights in the container image. This couples the model lifecycle to the image lifecycle, eliminates the value of a model registry, and makes rollback a re-build operation. Weights should be loaded from a registry-backed storage URI at pod start.
Treating the LLM serving runtime as a black box. vLLM and TGI both publish rich metrics. Ignoring them means missing early signals of KV-cache pressure, queue growth, and time-to-first-token regression.

References

[1] Kwon, W. et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” ACM SIGOPS SOSP 2023. arxiv.org/abs/2309.06180
[2] “Comparative Analysis of Large Language Model Inference Serving Systems: A Performance Study of vLLM and HuggingFace TGI.” arXiv:2511.17593, 2025. arxiv.org/abs/2511.17593
[3] CNCF. “KServe becomes a CNCF incubating project.” 2025. cncf.io
[4] KServe Documentation. “Data Plane — Predictor, Transformer, Explainer.” kserve.github.io
[5] Red Hat Developer. “How to set up KServe autoscaling for vLLM with KEDA.” 2025. developers.redhat.com
[6] PyTorch/Serve. “TorchServe Documentation.” docs.pytorch.org/serve
[7] BentoML Project. “BentoML Documentation.” docs.bentoml.com
[8] vLLM Project. “vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention.” Blog, 2023. vllm.ai
[9] NVIDIA. “Triton Inference Server.” GitHub. github.com/triton-inference-server

Continue the Journey

AI Platform