Why share a GPU? The economics, the mechanics, the four mechanisms

·9 min read·asleekgeek
A GPU chip divided into coloured quadrants representing multiple isolated partitions

Four mechanisms, one GPU — the economics that make sharing unavoidable

A GPU that runs a training job for eight hours and then sits idle for sixteen is not a training asset — it is an expensive space heater. Industry surveys put average GPU utilisation in enterprise environments at roughly 5 percent [1], and a 2025 large-scale HPC study found that 37 percent of jobs never exceeded 15 percent GPU memory utilisation across their entire run [2]. These are not outliers. They are the normal consequence of allocating whole GPUs to single workloads in an environment where workloads rarely fill the hardware they claim.

GPU sharing is the engineering response to that waste. This article makes the economic case for sharing, then introduces the four primary mechanisms — time-slicing, CUDA Multi-Process Service (MPS), Multi-Instance GPU (MIG), and HAMi fractional GPU — and maps each against the four operational axes that determine which mechanism fits which workload.

The utilisation problem and its cost

GPU compute is priced by the hour whether or not a workload is using the silicon. On-prem, the same logic applies in amortised form: every idle GPU-hour represents depreciation consumed without corresponding output. When teams over-provision — reserving a full GPU for a workload that occupies 10–20 percent of its compute and memory — the remainder of the card is unavailable to other work, even if it is sitting unused.

The waste compounds in serving workloads. A 2026 analysis of LLM serving traces from three production environments found that execution-idle intervals — periods when a GPU is allocated to a serving replica but processing no requests — accounted for 7 to 65 percent of total energy consumption, depending on load patterns [3]. That ceiling of 65 percent is not a pathological edge case; it reflects a lightly-loaded serving deployment where the model is resident in VRAM but the request queue is near-empty.

NVIDIA's own operations team quantified a version of this problem internally: before deploying continuous idle-job monitoring, approximately 5.5 percent of allocated GPU-hours were being consumed by jobs that had stalled or completed but not released their allocation. After deploying automated reaping, that figure dropped to below 1 percent [4]. The point is not the specific percentage; it is that idle detection alone recovered meaningful capacity without procuring additional hardware.

Sharing is not the only lever — better autoscaling, scale-to-zero inference, and queue-aware scheduling each address a slice of the problem. But sharing addresses the structural mismatch between GPU granularity (the smallest allocatable unit is typically one whole GPU) and workload demand (which varies continuously and rarely fills a whole GPU). Sharing subdivides the allocation unit, letting two or more workloads claim the capacity that one workload leaves on the table.

Four mechanisms at a glance

The four mechanisms differ in where they divide the GPU — in time, in software, in hardware — and those differences determine what isolation guarantees each can offer.

Time-slicing

Time-slicing is the simplest form of sharing: the GPU driver serialises access across multiple CUDA contexts, switching between them on a configurable time quantum. Each context sees the GPU as if it owns it exclusively, but only for its time slice. No hardware changes are required; the mechanism is implemented in the NVIDIA kernel driver and exposed through the device plugin configuration in a Kubernetes environment.

The misleading marketing claim: time-slicing is often described as enabling true concurrent GPU use. It does not. Context switching introduces measurable jitter. Red Hat's published analysis notes that time-slicing serialises access and is unsuitable for latency-sensitive serving workloads [5]. The NVIDIA developer forums document additional context-switch overhead at the CUDA level [6]. For batch training or offline inference with relaxed latency budgets, time-slicing is adequate. For an online inference endpoint with a P99 SLA in the tens of milliseconds, it is not.

CUDA Multi-Process Service (MPS)

MPS routes multiple CUDA client processes through a single server process. The MPS server submits work from all clients to the GPU concurrently, so kernel execution and memory-copy operations from different clients can genuinely overlap. On Volta-class hardware and later, each client gets a separate GPU address space, eliminating the address-space sharing risk of earlier MPS implementations. Clients share SM scheduling resources, meaning one misbehaving client can affect the throughput of others — there is no hard compute quota [7].

The throughput gains from eliminating serialisation are real. Databricks reported meaningful throughput improvements when deploying MPS for small-LLM inference serving [8]. A University of North Texas study measured 0–147 percent throughput improvement depending on workload mix and concurrency level [9]. The range is wide because the gain is proportional to the idle SM cycles that concurrent kernels can backfill — a workload that already saturates the GPU sees little benefit.

The misleading claim: MPS is sometimes described as providing isolation equivalent to separate GPU instances. It does not. A fault in one MPS client process can corrupt shared state and terminate all clients sharing that MPS server. MPS is appropriate for trusted, co-owned workloads — multiple replicas of the same inference service, or co-scheduled jobs from a single team — and not for multi-tenant environments where workloads must be fault-isolated from each other [7].

Multi-Instance GPU (MIG)

MIG is a hardware partitioning capability available on NVIDIA A100, H100, and Blackwell-class GPUs. It partitions a single physical GPU into up to seven fully isolated instances. Isolation is spatial and enforced in silicon: each instance's streaming multiprocessors (SMs) have separate and exclusive paths through the on-chip crossbar ports, L2 cache banks, memory controllers, and DRAM address buses [10][11]. A workload running in one MIG instance cannot observe the memory or execution of a workload in another, even if they share physical hardware.

This makes MIG the only mechanism in this set that provides hardware-enforced multi-tenant isolation. The others rely on software or driver-level boundaries. The consequence is that MIG is appropriate for regulated environments where workloads from different security domains must run on shared hardware without cross-contamination risk.

The misleading claim: MIG is sometimes presented as flexible, dynamic partitioning that can be resized on demand without disruption. In practice, changing MIG geometry requires destroying existing instances (which terminates any workloads running in them) and recreating the partition layout. This is an operational event, not a live resize. Teams that need to change their MIG configuration between training and serving phases need a planned maintenance window or a node-pool design that keeps the partition geometry fixed per pool.

MIG is also hardware-gated. It is not available on T4, L4, or V100 GPUs. Clusters with mixed GPU generations require a mechanism-per-pool strategy: MIG on Ampere/Hopper nodes, time-slicing or MPS on older hardware.

HAMi fractional GPU

HAMi (Heterogeneous AI Computing Virtualization Middleware) is a CNCF Sandbox project, accepted in August 2024 [12]. It implements fractional GPU allocation through a software virtualization layer: a shared library (libvgpu.so) is injected into container processes via LD_PRELOAD and intercepts CUDA driver and NVML API calls before they reach the hardware. The interception layer enforces per-container GPU memory and compute limits set by the scheduler, without requiring any hardware-level partitioning support.

This makes HAMi the most broadly compatible mechanism in this set. Because it operates at the user-space API layer rather than in hardware, it works on any CUDA-capable GPU — including those that do not support MIG. An L4 or T4 node that cannot be partitioned in hardware can still host fractional GPU workloads under HAMi.

The misleading claim: because HAMi intercepts at the API level, it is sometimes described as providing isolation equivalent to MIG. It does not. The isolation is enforced in software, not hardware. A workload that bypasses or subverts the LD_PRELOAD injection — for example, through a statically linked CUDA binary or a privileged container that replaces the library — can escape the limits. HAMi is appropriate for trusted, cooperative workloads where strict hardware-enforced isolation between untrusted tenants is not required.

The isolation matrix: four mechanisms × four axes

The four axes below are the operational properties that determine which mechanism fits which workload context. The table is a decision aid, not a ranking — there is no universally superior mechanism.

Memory isolation — whether one workload's VRAM is inaccessible to another at the hardware or driver level.

Compute isolation — whether one workload's SM utilisation is bounded and cannot starve another's execution.

Fault isolation — whether a crash or OOM in one workload terminates other workloads sharing the same GPU.

Latency jitter — whether co-located workloads introduce unpredictable latency variability into each other's inference path.

Isolation matrix (text table)
Mechanism      | Memory iso. | Compute iso. | Fault iso. | Latency jitter
---------------|-------------|--------------|------------|---------------
Time-slicing   | None        | None         | Partial    | High
MPS            | Partial*    | None         | None       | Low-Medium
MIG            | Hardware    | Hardware     | Hardware   | Negligible
HAMi           | Software    | Software     | Partial    | Low-Medium

* MPS on Volta+ provides separate address spaces per client,
  but all clients share one MPS server process — a server crash
  terminates all clients.

A few cells deserve explanation:

  • Time-slicing fault isolation is marked Partial because a GPU-reset event triggered by one context will typically clear all contexts on that GPU. Workloads in other time-slices are therefore exposed to each other's crash behaviour, but not to each other's in-flight data (contexts do not share address space).
  • HAMi fault isolation is Partial for the same reason plus an additional caveat: if the intercept library itself faults, all workloads sharing that library instance may be affected.
  • MIG's Negligible latency jitter rating reflects the hardware-isolated L2 cache and memory controller paths. In practice, workloads in different MIG instances do not contend for memory bandwidth, which is the dominant source of jitter under concurrent inference load.

Hardware availability gates the choice

Not every mechanism is available on every GPU. MIG requires Ampere (A100), Hopper (H100, H200), or Blackwell-class hardware [11]. Time-slicing and MPS are available on any CUDA-capable GPU. HAMi's LD_PRELOAD architecture places no hardware requirement beyond CUDA support, though its effective memory enforcement depends on the GPU driver's NVML implementation.

In a cluster with mixed GPU generations — A100 nodes alongside T4 or L4 nodes, for example — the practical approach is to fix the sharing mechanism per node pool rather than per workload. MIG on A100/H100 pools, HAMi or time-slicing on older pools. This avoids per-node mechanism negotiation and makes scheduling predicates simpler: a workload requiring hardware-isolated VRAM targets the MIG pool; a workload tolerating software isolation targets the HAMi pool.

Matching mechanism to workload

Three workload archetypes dominate production AI platforms:

  • Large distributed training runs — these typically benefit least from sharing. A well-utilised training job that fills a node's GPUs is already efficient. Sharing here is counterproductive: it adds scheduling complexity without recovering significant idle capacity. Gang scheduling (ensuring all GPUs in a distributed job start together) matters more than partitioning. See the preceding article in this series on queue-aware scheduling for the relevant patterns.
  • Online inference serving — typical VRAM utilisation for a serving replica depends heavily on model size and batch configuration. Many small-to-medium model serving deployments use 20–60 percent of a GPU's VRAM, leaving substantial capacity unused. MPS (for trusted co-tenancy) or MIG (for isolated multi-tenancy) both apply here. Time-slicing is generally inappropriate because context-switch jitter degrades P99 latency.
  • Notebook and experimentation workloads — these are the primary source of idle-GPU waste. A notebook that allocates a GPU on launch and then sits idle for hours while the user reads documentation is a textbook over-provisioning case. Time-slicing or HAMi fractional allocation constrains the impact of any one notebook on shared capacity. Fault isolation matters less here because notebook workloads are interactive and ephemeral.

The general principle: stricter isolation requirements justify higher mechanism cost. MIG is operationally heavier than time-slicing — it requires node-level partition configuration, a compatible GPU operator, and a planned change process for geometry changes. That cost is justified when tenants are from different security domains or when VRAM leak between workloads is an unacceptable risk. For workloads where those risks are low, a lighter mechanism is preferable.

What this series covers next

This article has introduced the mechanisms and framed the problem. The next article in this series (article 27, gpu-sharing-decision-tree) works through a structured decision tree that routes a specific workload and cluster context to the appropriate mechanism, including the trap branches — the configurations that look correct but fail under production conditions.

References

  1. "5% GPU utilization: The $401 billion AI infrastructure problem enterprises can't keep ignoring." VentureBeat, April 2026. https://venturebeat.com/infrastructure/5-gpu-utilization-the-401-billion-ai-infrastructure-problem-enterprises-cant-keep-ignoring
  2. "Analyzing GPU Utilization in HPC Workloads: Insights from Large-Scale Systems." ACM PEARC 2025. https://dl.acm.org/doi/10.1145/3708035.3736010
  3. "The Energy Cost of Execution-Idle in GPU Clusters." arXiv:2604.04745, April 2026. https://arxiv.org/abs/2604.04745
  4. "Making GPU Clusters More Efficient with NVIDIA Data Center Monitoring Tools." NVIDIA Technical Blog. https://developer.nvidia.com/blog/making-gpu-clusters-more-efficient-with-nvidia-data-center-monitoring/
  5. "Sharing is caring: How to make the most of your GPUs (part 1 — time-slicing)." Red Hat Blog. https://www.redhat.com/en/blog/sharing-caring-how-make-most-your-gpus-part-1-time-slicing
  6. NVIDIA Developer Forums: "CUDA context switching overhead of current GPU." https://forums.developer.nvidia.com/t/cuda-context-switching-overhead-of-current-gpu/65918
  7. NVIDIA CUDA Multi-Process Service (MPS) Documentation (official). https://docs.nvidia.com/deploy/mps/
  8. "Scaling Small LLMs with NVIDIA MPS." Databricks Engineering Blog. https://www.databricks.com/blog/scaling-small-llms-nvidia-mps
  9. "Granularity- and Interference-Aware GPU Sharing with MPS." University of North Texas CSRL. https://engineering.unt.edu/cse/research/labs/csrl/files/Granularity_Alex.pdf
  10. "Getting the Most Out of the NVIDIA A100 GPU with Multi-Instance GPU." NVIDIA Technical Blog. https://developer.nvidia.com/blog/getting-the-most-out-of-the-a100-gpu-with-multi-instance-gpu/
  11. NVIDIA MIG User Guide (official). https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
  12. CNCF HAMi Project Page (accepted August 2024). https://www.cncf.io/projects/hami/
  • HAMi's Low–Medium latency-jitter rating reflects the overhead of its LD_PRELOAD interception layer: every CUDA memory-allocation call passes through libvgpu.so before reaching the driver, adding a small but non-zero per-call cost. Unlike MIG, there is no hardware context switch; unlike time-slicing, execution is not serialised. The jitter introduced is therefore bounded by API-call overhead rather than by context-switch latency, which keeps it lower than time-slicing but measurably above MIG's hardware-isolated baseline.

Tags

#gpu-sharing#mig#mps#hami#series:ai-platform-mlops#series-order/26

About the Author

asleekgeek

asleekgeek

Senior Developer, Architect, DevOps

Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist

Thanks for reading! Explore more articles.

Back to Articles