Multi-tenancy on a shared AI platform — quotas, fairness, and the noisy-neighbour problem

Multi-tenancy layers: compute, network, storage, and control-plane isolation on a shared Kubernetes cluster.
A shared AI platform is only useful if the teams sharing it can trust it. That trust has two components: isolation — your workloads are not silently affected by mine — and fairness — I cannot grab all the GPUs while you are waiting in queue. Getting both right on a single Kubernetes cluster is harder than it looks, and the failure modes are specific enough that each deserves a name.
This article maps the four isolation dimensions that matter on an AI platform, explains the Kubernetes primitives available for each, and works through the decision rule for the two primary tenancy models. GPU-specific fairness — fractional scheduling, gang scheduling, and queue depth management — is a sufficiently deep topic that it gets its own treatment in Part 7 (article 28 onward); this article frames the problem and hands off cleanly.
The four isolation dimensions
Multi-tenancy on Kubernetes is not a single knob. It is the intersection of four independent isolation axes, and a gap in any one of them can make the others irrelevant:
- Compute isolation — one tenant's pods cannot exhaust CPU, memory, or GPU capacity at the expense of another. Enforced by ResourceQuota, LimitRange, and priority classes.
- Network isolation — tenant A's pods cannot reach tenant B's pods unless an explicit policy permits it. Enforced at Layer 3/4 by NetworkPolicy objects (CNI-dependent) and at Layer 7 by a service mesh.
- Storage isolation — PersistentVolumeClaims are namespace-scoped and not accessible across tenants; StorageClass selection determines the underlying provisioner and its own access-control model.
- Control-plane isolation — a tenant's service accounts cannot list or mutate objects in other tenants' namespaces. Enforced by RBAC with namespace-scoped Role and RoleBinding objects. The Kubernetes multi-tenancy documentation calls control-plane isolation the most important type.
Tenancy model decision rule: namespace-per-tenant vs cluster-per-tenant
The two dominant tenancy models for an internal AI platform are namespace isolation (multiple teams share one cluster, separated by namespaces) and cluster isolation (each major tenant owns a dedicated cluster, joined by a shared control plane or federation layer). Neither is universally correct; the choice follows from at least four criteria.
Choose namespace-per-tenant when:
- Teams are mutually trusted (same organisation, same threat model). Namespace boundaries rely on Kubernetes RBAC, which is a software gate, not a hardware one. If a tenant can run arbitrary container code and you cannot trust that code, namespace isolation alone is insufficient.
- Operational leverage matters more than hard isolation. A single cluster means a single API server upgrade path, a single NVIDIA GPU Operator DaemonSet, one observability stack, and one node-pool autoscaler. The ops overhead per team scales sub-linearly instead of linearly.
- Burst sharing is a feature. If team A's training run finishes early and team B has a pending job, a shared cluster lets team B consume the slack. Separate clusters make that sharing explicit and expensive — you need a federation layer or manual node migration.
- You can enforce ResourceQuota and priority classes consistently. Namespace-per-tenant works when an ops team owns the quota allocation and can update it without each tenant having cluster-admin. [2]
Choose cluster-per-tenant when:
- Regulatory or contractual requirements demand hard isolation of compute or data residency. Financial services and healthcare contexts may require that training data for tenant A is physically separate from tenant B's compute boundary, not merely namespace-separated.
- A tenant requires privileged access to the cluster (cluster-admin, custom admission controllers, mutating webhooks that could affect other namespaces). Granting that access on a shared cluster undermines isolation for every other tenant.
- Blast-radius risk is unacceptable. A cluster-level misconfiguration — a broken admission webhook, a node kernel panic triggered by a specific workload — affects all tenants on a shared cluster. If tenant independence is worth the ops overhead, separate clusters enforce it structurally rather than by policy.
- The cluster's cost-per-tenant is already high enough that dedicated hardware is the natural unit. GPU clusters with expensive, scarce hardware favour sharing; CPU-only inference clusters may not. The Kubernetes blog's three-tenancy-model survey puts linear ops overhead as the defining cost of cluster-per-tenant. [3]
A hybrid approach — one shared cluster for trusted internal teams, separate clusters for externally-facing or regulated workloads — is a common practical outcome. Virtual clusters (a Kubernetes-native layer that provisions lightweight API servers inside an existing cluster) represent a newer point in the spectrum that trades stronger isolation for lower overhead than fully separate clusters; they are worth evaluating when the namespace model becomes insufficient but the cluster-per-tenant cost is prohibitive.
Compute isolation: ResourceQuota and LimitRange
# Illustrative ResourceQuota for one team namespace.
# Tune limits to your capacity model; these values are not universal defaults.
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-alpha-quota
namespace: team-alpha
spec:
hard:
requests.cpu: "40"
limits.cpu: "80"
requests.memory: 200Gi
limits.memory: 400Gi
requests.nvidia.com/gpu: "4"
limits.nvidia.com/gpu: "4"
persistentvolumeclaims: "20"
requests.storage: 2TiOne common mistake is setting only limits without requests for GPU resources. The Kubernetes scheduler uses requests for placement decisions; a pod with no GPU request will not trigger a node with a GPU taint, even if the limit is set. On AI platform namespaces, it is good practice to set GPU requests equal to limits — GPUs are not a compressible resource, so overcommitting the limit meaninglessly inflates the namespace quota ceiling without providing a real burst buffer.
Scheduling fairness: QoS classes and priority
# Three-tier PriorityClass hierarchy for an AI platform namespace model.
# Adjust values to fit your cluster's existing system priority range.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ai-inference-production
value: 100000
preemptionPolicy: Never # inference pods are not evicted; preemption only for critical-system
globalDefault: false
description: "Live inference services. Non-preemptible."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ai-training-scheduled
value: 50000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "Scheduled training jobs. May preempt interactive experiments."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: ai-interactive
value: 10000
preemptionPolicy: PreemptLowerPriority
globalDefault: true
description: "Notebooks and ad-hoc experiments. Preemptible."The noisy-neighbour failure mode in compute isolation almost always originates from a team that either ignores LimitRange defaults or submits a batch job with artificially high priority. Quota catches the aggregate; priority classes catch the scheduling queue. Both together enforce the boundary; neither alone is sufficient.
Network isolation: NetworkPolicy and service mesh
The default-deny pattern is the correct starting posture for a multi-tenant AI platform: apply a NetworkPolicy with an empty podSelector (matching all pods in the namespace) and no ingress or egress rules. This isolates every pod in the namespace from all other pods cluster-wide. You then layer allow-rules on top for the specific communication your workload requires — internal service-to-service, egress to an external model registry, ingress from an API gateway namespace.
# Default-deny baseline for a tenant namespace.
# Apply this first; then add targeted allow policies on top.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: team-alpha
spec:
podSelector: {} # matches all pods in namespace
policyTypes:
- Ingress
- Egress
# No ingress or egress rules => all traffic blocked.The operational cost of running a service mesh should not be underestimated: it adds a sidecar (or ambient mode eBPF layer, in more recent designs) to every pod, increases observability surface area, and requires careful certificate rotation. For strictly internal, trusted teams, NetworkPolicy alone may be sufficient. The service mesh becomes worth the overhead when workloads cross trust boundaries — for example, when a shared inference gateway in one namespace routes to model servers owned by separate teams with independent security postures.
Namespace hierarchies and cross-namespace quota propagation
HNC's archival means the pattern it implemented — hierarchical policy propagation — now needs to be reached via alternatives. The two practical paths are: (1) platform tooling that generates and reconciles namespace-level policies from a higher-level abstraction (a GitOps repo of team manifests, a custom operator, or a policy-management layer like Kyverno or OPA Gatekeeper used to propagate baseline policies); and (2) virtual clusters, which give each team its own Kubernetes API server namespace while sharing the underlying node infrastructure, effectively replacing the propagation problem with a cleaner isolation boundary.
Operational note: if your platform was already using HNC v1.1.0, the project entering archived status means no new features or security patches. Evaluate whether the propagation pattern you relied on can be replicated by your policy layer before upgrading Kubernetes to a version where HNC compatibility breaks.
The GPU fairness problem: a preview
The compute isolation mechanics described above — ResourceQuota, LimitRange, PriorityClass — handle CPU and memory well. GPUs introduce a harder problem, because the Kubernetes default-scheduler treats a GPU as an indivisible integer unit. A team with a quota of four GPUs and a pending four-GPU training job will consume all four simultaneously, even if the actual utilisation is below 20% for most of the run.
Two compounding problems make GPU fairness distinct from CPU fairness:
- GPU time-slicing shares a physical GPU between multiple pods with no memory isolation. One pod can exhaust GPU memory and cause CUDA out-of-memory errors in co-resident pods. [10] This makes time-slicing appropriate for inference but not for training jobs that make large, unpredictable memory allocations.
- Distributed training jobs (e.g. multi-node, multi-GPU) require gang scheduling: all workers must be schedulable simultaneously or none should be scheduled. A partial allocation blocks the GPUs it holds without making progress, degrading fairness for every other tenant.
Addressing GPU fairness requires a specialised scheduler layer — one that understands gang semantics, can implement per-team GPU quotas that account for fractional sharing, and can provide queue-depth fairness across teams with heterogeneous job sizes. Part 7 of this series (beginning at article 28) covers that scheduling stack in detail, including fractional GPU mechanisms and gang-scheduling primitives (e.g. Volcano, Kueue, and related schedulers). The patterns described in this article — namespace isolation, ResourceQuota, and PriorityClass — are the necessary prerequisite layer, not the complete solution.
Observability across tenant boundaries
Multi-tenancy without per-tenant observability is hard to operate: when a fairness incident occurs, you need to identify quickly which namespace is consuming above quota, which queue priority class is being abused, and whether a network policy gap is generating unexpected cross-namespace traffic.
The minimal observable set for a multi-tenant AI platform namespace includes: CPU and memory request-versus-usage delta per namespace (detects underprovisioned LimitRange defaults), GPU utilisation per namespace and per pod (exposes idle-hold patterns before ResourceQuota catches them at admission), and NetworkPolicy hit counts per namespace (surfaces misconfigured policies silently dropping traffic). These metrics are available from standard Kubernetes metrics-server and the DCGM exporter for GPU counters, plumbed into your metrics pipeline (e.g. Prometheus, or a managed equivalent) and visualised with namespace as a label dimension.
Summary: the isolation stack
A working multi-tenancy posture for a shared AI platform combines all four isolation dimensions. Control-plane isolation (RBAC with namespace-scoped roles) prevents tenant A from reading or mutating tenant B's objects. Compute isolation (ResourceQuota + LimitRange + PriorityClass) prevents one team from crowding out another in the scheduler queue or exhausting node resources. Network isolation (default-deny NetworkPolicy, CNI enforcement, optional service mesh for cryptographic identity) prevents lateral movement between tenant namespaces. Storage isolation (namespace-scoped PVCs, StorageClass access control) prevents cross-tenant data access at the storage layer.
None of these mechanisms is optional if you intend the platform to be shared in a meaningful sense. They are also layered: a gap in one dimension — an overly permissive NetworkPolicy, a missing LimitRange — does not invalidate the others, but it does create a surface that an over-allocated or misconfigured workload will eventually exploit unintentionally. The noisy-neighbour problem on an AI platform is almost always accidental rather than adversarial; the defence is comprehensive policy coverage, not paranoia.
References
- Kubernetes Project. "Multi-tenancy." Kubernetes Documentation, 2024. https://kubernetes.io/docs/concepts/security/multi-tenancy/
- Kubernetes Project. "Resource Quotas." Kubernetes Documentation, 2024. https://kubernetes.io/docs/concepts/policy/resource-quotas/
- Kubernetes Blog. "Three Tenancy Models For Kubernetes." April 2021. https://kubernetes.io/blog/2021/04/15/three-tenancy-models-for-kubernetes/
- Kubernetes Project. "Limit Ranges." Kubernetes Documentation, 2024. https://kubernetes.io/docs/concepts/policy/limit-range/
- Kubernetes Project. "Quality of Service for Pods." Kubernetes Documentation, 2024. https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/
- Kubernetes Project. "Pod Priority and Preemption." Kubernetes Documentation, 2024. https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/
- Kubernetes Project. "Network Policies." Kubernetes Documentation, 2024. https://kubernetes.io/docs/concepts/services-networking/network-policies/
- Istio Project. "Mutual TLS Migration." Istio Documentation, 2024. https://istio.io/latest/docs/tasks/security/authentication/mtls-migration/
- kubernetes-sigs. "Hierarchical Namespace Controller (HNC)." GitHub, last release v1.1.0 June 2023, archived April 2025. https://github.com/kubernetes-sigs/hierarchical-namespaces
- Loft Labs (vCluster). "GPU Multitenancy in Kubernetes: Strategies and Solutions." 2024. https://www.vcluster.com/blog/gpu-multitenancy-kubernetes-strategies
Tags
About the Author

asleekgeek
Senior Developer, Architect, DevOps
Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist
Thanks for reading! Explore more articles.
Back to Articles