Picking a GPU-sharing mechanism — a decision tree

·9 min read·asleekgeek
A flowchart decision tree with GPU-related branch labels on a dark technical background

Four mechanisms, one decision tree — know which branch you are on before you configure.

Choosing a GPU-sharing mechanism is not a configuration detail — it is an architectural commitment with implications for isolation, observability, and operational overhead that are difficult to reverse once workloads are running. The four main mechanisms available today — time-slicing, CUDA Multi-Process Service (MPS), Multi-Instance GPU (MIG), and HAMi — each optimise for a different point in the space of workload mix, hardware generation, and isolation requirement. Picking the wrong one typically surfaces weeks later as either wasted capacity (over-isolated, under-utilised) or a noisy-neighbour incident (under-isolated, over-shared).

This article walks through a decision tree with eight decision nodes. Each leaf names the mechanism and the one trade-off the reader is explicitly accepting. The "no sharing" leaf is included as a legitimate answer — single-tenant assignment is correct for some workloads, and the decision tree should say so rather than pressuring toward sharing by default.

The four mechanisms at a glance

Before the decision tree, a crisp characterisation of each mechanism is necessary — because the marketing framing around each one conceals a significant caveat.

Time-slicing

Time-slicing exposes N logical nvidia.com/gpu resources per physical GPU; the kernel scheduler time-multiplexes CUDA contexts across those replicas. The mechanism is configured via the NVIDIA device plugin's time-slicing ConfigMap or via the GPU Operator's ClusterPolicy. The marketing claim is "multiply your GPU count." The caveat the marketing does not lead with: there is no memory isolation and no fault isolation between replicas. NVIDIA's own documentation states explicitly that "unlike Multi-Instance GPU (MIG), there is no memory or fault-isolation between replicas" and that "if one workload crashes, they all do" [1].

CUDA Multi-Process Service (MPS)

MPS is a client/server architecture in which a persistent server process (nvidia-cuda-mps-server) holds a single shared CUDA context and multiplexes GPU operations from multiple client processes concurrently rather than via context-switching. The result is lower context-switch overhead and better SM utilisation for workloads with many small kernels — a meaningful gain for batched inference serving.

On Volta-generation hardware and later, MPS adds per-client address space isolation and limited execution resource provisioning for quality of service [2]. The caveat: fault isolation is still incomplete — a fatal error in one client process can affect other clients sharing the same MPS server [2]. MPS is not a substitute for MIG when hard isolation is a requirement. Kubernetes integration is available via the NVIDIA device plugin's sharing.mps configuration, though the operational complexity of the control daemon adds a management surface.

Multi-Instance GPU (MIG)

MIG partitions supported GPU hardware at the silicon level into up to seven independent instances, each with dedicated memory bandwidth, L2 cache, and compute engines. Memory, fault, and bandwidth isolation are all hardware-enforced — not software-emulated. A CUDA OOM in one MIG instance cannot affect another.

The hard constraint: MIG is hardware-gated. NVIDIA's MIG User Guide lists the supported SKUs as A100 (40 GB and 80 GB), A30, H100, H200 141 GB, B200 180 GB, and GB200 186 GB — all Ampere architecture (compute capability ≥ 8.0) or later [3]. T4, L4, A10G, V100, and every Turing-or-older GPU do not support MIG. The marketing claim is "true hardware partitioning." That claim is accurate — but only on the hardware it applies to.

HAMi (Heterogeneous AI Computing Virtualisation Middleware)

HAMi is a CNCF Sandbox project, accepted on 21 August 2024 [4]. It implements a scheduler extender plus a modified device plugin that exposes memory and compute-core allocation as first-class Kubernetes schedulable resources. A pod can request, for example, 4000 MiB of GPU memory and 25% of compute cores; HAMi enforces the memory ceiling at the CUDA-call interception layer and applies a soft rate limit on compute. The result is hard memory isolation without hardware partitioning — meaningful on GPU SKUs that do not support MIG. HAMi's CNCF Sandbox status signals active community governance, but the project is smaller than the GPU Operator's ecosystem; verify maintenance cadence before committing it to a production path.

The caveat: compute isolation is soft. The memory cap is enforced via CUDA API interception; the compute cap is a rate limit that can be saturated by adversarial or misbehaving workloads. HAMi does not provide fault isolation — a process-level crash can still affect co-located tenants.

The decision tree

The tree has eight decision nodes and five leaf outcomes, including the "no sharing" leaf. Read it top to bottom: the first question is always the strongest constraint.

gpu-sharing-decision-tree.mermaid
graph TB
    N1{"1. Is hard isolation\nrequired?\n(multi-org, regulatory,\nor compliance mandate)"}
    N2{"2. Is the GPU\nA100 / A30 / H100\nor newer Ampere+?"}
    N3{"3. Workload\nprofile?"}
    N4{"4. Is memory\nisolation required\nbetween tenants?"}
    N5{"5. Is driver lifecycle\nmanaged by the\nplatform operator?"}
    N6{"6. Is SM utilisation\nthe bottleneck?\n(many small kernels,\nbatched inference)"}
    N7{"7. Can tenants\ntolerate soft compute\nisolation?"}
    N8{"8. Accept reduced\nfault isolation?"}

    LEAF_NOSHAR["No sharing —\nsingle tenant per GPU\nTrade-off: capacity reserved,\nno noisy-neighbour risk"]
    LEAF_MIG["MIG\nTrade-off: fixed profile\nsizes, admin overhead\non re-partitioning"]
    LEAF_NOSHAR2["No sharing —\nsingle tenant per GPU\n(MIG not available;\nhard isolation requires\nsingle tenancy)"]
    LEAF_MPS["MPS\nTrade-off: incomplete\nfault isolation; control\ndaemon overhead"]
    LEAF_HAMI["HAMi\nTrade-off: soft compute\ncap; smaller ecosystem"]
    LEAF_TS_OP["Time-slicing via\nGPU Operator\nTrade-off: no isolation;\ndriver-stack coupling"]
    LEAF_TS_DIR["Time-slicing via\ndirect device plugin\nTrade-off: no isolation;\nDCGM not bundled"]

    N1 -->|"Single tenant —\nno sharing needed"| LEAF_NOSHAR
    N1 -->|"Hard isolation required"| N2
    N2 -->|"Yes — Ampere+"| LEAF_MIG
    N2 -->|"No — Volta/Turing/older"| LEAF_NOSHAR2
    N1 -->|"Sharing OK,\nno hard isolation mandate"| N3
    N3 -->|"Throughput batch\n(training, preprocessing)"| N5
    N3 -->|"Latency-sensitive\nmulti-tenant inference"| N4
    N4 -->|"Memory isolation needed"| N7
    N4 -->|"Soft jitter acceptable"| N5
    N7 -->|"Yes"| LEAF_HAMI
    N7 -->|"No — hard compute\nisolation also required"| N2
    N5 -->|"Yes — full stack\ncontrol (on-prem)"| N6
    N5 -->|"No — cloud-managed\ndriver"| LEAF_TS_DIR
    N6 -->|"Yes"| N8
    N6 -->|"No"| LEAF_TS_OP
    N8 -->|"Yes"| LEAF_MPS
    N8 -->|"No"| LEAF_TS_OP

    classDef decision fill:#FFB627,stroke:#8B5A00,color:#000
    classDef leaf fill:#06A77D,stroke:#024D3B,color:#fff
    classDef noshar fill:#4A5568,stroke:#2D3748,color:#fff
    class N1,N2,N3,N4,N5,N6,N7,N8 decision
    class LEAF_MIG,LEAF_MPS,LEAF_HAMI,LEAF_TS_OP,LEAF_TS_DIR leaf
    class LEAF_NOSHAR,LEAF_NOSHAR2 noshar

Walking each branch

Branch 1 — Single tenant, no sharing needed

If the workload will exclusively occupy a GPU for its full duration and no other tenant will share the device, none of the sharing mechanisms add value — they add complexity. Training runs that saturate GPU memory and compute on a single node, long-horizon fine-tuning jobs, or any workload where the team is paying for a dedicated node pool should take this branch. The device plugin in its default mode assigns one physical GPU per container; no ConfigMap is required. The trade-off the reader is accepting is that capacity is reserved exclusively, which reduces peak utilisation on sparsely scheduled nodes. For workloads that genuinely use the full GPU, this is not a trade-off — it is the correct operating point.

Branch 2 — Hard isolation required, Ampere+ hardware → MIG

Regulated multi-organisation environments — where a CUDA OOM in one tenant's workload causing another tenant's inference to crash is an unacceptable outcome — require hardware isolation. On supported hardware (A100, A30, H100, H200, B200, GB200), MIG provides it. The GPU Operator's MIG manager handles partition lifecycle via the mig.strategy field in the ClusterPolicy. The trade-off the reader is accepting is that MIG profile sizes are fixed (NVIDIA defines the available profiles, e.g. 1g.10gb, 2g.20gb, 3g.40gb on an A100-80GB) and re-partitioning requires draining the node — workloads must be evicted before profiles can change. MIG is not elastic; profile selection is a planning exercise.

Branch 3 — Hard isolation required, pre-Ampere hardware → no sharing

If a hard isolation mandate applies and the hardware does not support MIG, there is no software-only mechanism that provides equivalent guarantees — time-slicing and MPS both have documented fault isolation gaps [1][2]. The correct answer in this case is single-tenant assignment: one workload per GPU, with node-level isolation between tenants achieved through Kubernetes node taints and RBAC on node pools. This is a capacity cost, not a configuration problem. The read here for the platform team is that if regulated multi-tenancy at scale is a requirement, the hardware procurement case for Ampere+ GPUs becomes straightforward.

Branch 4 — Latency-sensitive inference, memory isolation needed → HAMi

Multi-tenant inference serving where multiple models share a single GPU and each model needs a bounded memory allocation — but where the full hardware isolation of MIG is either unavailable (wrong SKU) or unnecessary — is the core HAMi use case. HAMi's CUDA API interception layer enforces the memory ceiling even against misbehaving processes. The trade-off the reader is accepting: compute isolation is a soft rate limit, not a hardware cap. A tenant that generates a high volume of small CUDA kernels can saturate compute even if memory is bounded. If both memory and compute isolation are hard requirements on non-MIG hardware, the correct answer remains single-tenant assignment (branch 3 logic applies).

Branch 5 — Cloud-managed driver → time-slicing via direct device plugin

On managed Kubernetes services where the host GPU driver is pre-installed and managed by the cloud provider, deploying the GPU Operator's driver lifecycle manager creates a version conflict with the pre-installed driver. The safe path is the direct device plugin DaemonSet (nvidia/k8s-device-plugin) with a time-slicing ConfigMap. The trade-off: DCGM telemetry (GPU utilisation metrics, tensor-core activity) is not bundled — it must be deployed as a standalone DaemonSet with its own ServiceMonitor, and the label selectors must be manually aligned with whatever Prometheus deployment is in use. This is a common misconfiguration point. There is no memory or fault isolation in this path [1].

Branch 6 — Full stack control, SM utilisation is the bottleneck → MPS

When the platform team controls the full driver stack, the workload profile is batched inference with many small concurrent CUDA kernels, and Streaming Multiprocessor utilisation is the bottleneck (not memory bandwidth), MPS can improve throughput by eliminating context-switch overhead between processes. The control daemon adds an operational surface: nvidia-cuda-mps-control must be running on each GPU node, and its lifecycle needs to be managed alongside driver upgrades. On Volta and later architectures, per-client address space isolation reduces (but does not eliminate) blast radius from client faults [2]. The trade-off the reader is explicitly accepting: incomplete fault isolation. If a client process encounters a fatal GPU error, other clients sharing the MPS server may be affected.

Branch 7 — Full stack control, SM not the bottleneck → time-slicing via GPU Operator

For general-purpose multi-tenant sharing on a cluster where the platform team owns the driver stack — developer notebook environments, batch evaluation queues, small fine-tuning runs — time-slicing delivered via the GPU Operator is the lowest-friction option. The Operator bundles driver management, toolkit, device plugin, MIG manager, DCGM exporter, and Node Feature Discovery into a single ClusterPolicy-driven stack. The time-slicing replica count is set per GPU architecture in a ConfigMap. The trade-off: no isolation of any kind, and the driver-stack coupling means GPU Operator upgrades require coordinated node drains. For single-team or low-contention clusters where fairness rather than hard isolation is the requirement, this trade-off is usually acceptable.

The isolation matrix

The following table maps each mechanism against the four isolation axes and surfaces the one misleading marketing claim for each.

isolation-matrix.txt
Mechanism          | Memory isolation | Compute isolation | Fault isolation | Bandwidth isolation | Misleading claim
-------------------|------------------|-------------------|-----------------|---------------------|------------------------------------------
Time-slicing       | None             | None              | None            | None                | "Multiply your GPU count"
MPS                | Partial*         | Partial (Volta+)  | Incomplete**    | None                | "Concurrent kernel execution"
HAMi               | Hard (enforced)  | Soft (rate-limit) | None            | None                | "Memory isolation without MIG hardware"
MIG                | Hardware         | Hardware          | Hardware        | Hardware            | "True partitioning" (hardware-gated only)

* MPS Volta+: per-client address space. Pre-Volta: shared address space.
** A fatal error in one MPS client can affect co-located clients on the same server.

Reading this table: the only row with hardware-enforced isolation across all four axes is MIG — and that row applies only to supported SKUs. Every other mechanism involves at least one software-only or absent isolation guarantee.

Trap branches — where teams go wrong

Trap 1: using time-slicing for regulated multi-tenant inference

The most common error is arriving at the "multiply GPU count" capability of time-slicing, treating it as an isolation mechanism, and deploying it in a context where a compliance requirement actually mandates memory or fault isolation. The decision tree blocks this: node 1 (isolation required?) routes before the sharing-mechanism selection. If the answer to node 1 is "yes, hard isolation is required" and the hardware cannot support MIG, the tree exits to single-tenant assignment, not to a software-sharing mechanism with known gaps.

Trap 2: deploying MPS for fault-sensitive inference serving

MPS improves SM utilisation and is a reasonable choice for controlled, low-fault-probability workloads — but it is not a sharing mechanism for production multi-tenant serving where one tenant's OOM must be invisible to all others. The fault isolation gap is documented [2]. Teams that choose MPS for latency-sensitive inference without reading the fault-isolation note often discover the limitation during an incident rather than during architecture review.

Trap 3: deploying the GPU Operator on cloud-managed driver nodes

The GPU Operator's driver lifecycle manager installs and manages the NVIDIA driver. Managed Kubernetes services on hyperscalers pre-install a specific driver version as part of the node image. When both are present on the same node, they conflict. The correct path on managed services is the direct device plugin without the Operator's driver component, or — if DCGM telemetry is needed — the GPU Operator deployed with driver.enabled=false and toolkit.enabled=false. The decision tree surfaces this via node 5 ("driver lifecycle managed by the platform operator?").

Trap 4: treating HAMi's soft compute cap as equivalent to MIG's hard compute isolation

HAMi's memory cap is hard. HAMi's compute cap is a rate limit applied via CUDA API interception — it is software-enforced, not hardware-enforced. A workload that generates bursts of CUDA operations can transiently exceed the compute allocation until the rate limiter catches up. In adversarial or high-contention scenarios, the soft cap provides predictability for well-behaved workloads but is not a hard guarantee. Teams deploying HAMi for compute-sensitive latency SLAs should benchmark the compute enforcement under their specific workload pattern before committing.

Applying the decision in practice

A useful exercise before reaching for the tree is to characterise the workload cluster along three axes: (1) the strictest isolation requirement any single tenant imposes — this is the binding constraint, not the average; (2) the GPU SKU mix across the cluster, because heterogeneous nodes may require different mechanisms per node pool; (3) who controls the driver lifecycle. These three inputs resolve most of the tree's decision nodes.

Mixed GPU fleets are common. An organisation may have both Ampere-generation nodes (eligible for MIG) and Turing-generation nodes (time-slicing or HAMi only). The decision is then made per node pool: a NodeSelector or node label (e.g. nvidia.com/gpu.product) routes workloads to the correct pool. The GPU Operator's Node Feature Discovery component can automatically label nodes with GPU architecture data, which makes the per-pool routing maintainable.

One more operational note: re-partitioning MIG on a live node requires draining the node of workloads. MIG profile sizes are set at node initialisation time (via the GPU Operator MIG manager ConfigMap) and changing them after the fact means an eviction cycle. Build this into cluster expansion planning — it is not a zero-downtime operation for existing workloads on that node.

Where this fits in the series

This decision tree assumes the reader has read Article 26 — GPU sharing mechanisms overview, which introduces the four mechanisms without the routing logic. The two articles that follow go deeper on specific leaves: Article 28 — HAMi fractional GPU covers HAMi's architecture and operational model in full; Article 29 — MIG configuration strategy covers MIG profile selection, the single-vs-mixed strategy modes, and the re-partitioning lifecycle.

References

[1] NVIDIA Corporation. "Time-Slicing GPUs in Kubernetes." NVIDIA GPU Operator documentation, 2024. https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html

[2] NVIDIA Corporation. "CUDA Multi-Process Service (MPS)." NVIDIA MPS documentation (including Architecture section). https://docs.nvidia.com/deploy/mps/

[3] NVIDIA Corporation. "Multi-Instance GPU User Guide — Supported GPUs." NVIDIA Data Center documentation, 2024. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/

[4] Cloud Native Computing Foundation. "HAMi project page — accepted to CNCF Sandbox 21 August 2024." https://www.cncf.io/projects/hami/. Formal sandbox proposal: https://github.com/cncf/sandbox/issues/97

Tags

#gpu-sharing#decision-tree#mig#mps#hami#series:ai-platform-mlops#series-order/27

About the Author

asleekgeek

asleekgeek

Senior Developer, Architect, DevOps

Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist

Thanks for reading! Explore more articles.

Back to Articles