Multi-tenancy on a shared AI platform — quotas, fairness, and the noisy-neighbour problem

·9 min read·asleekgeek
Abstract diagram of partitioned resource pools on a shared cluster, representing multi-tenant isolation boundaries

Multi-tenancy layers: compute, network, storage, and control-plane isolation on a shared Kubernetes cluster.

A shared AI platform is only useful if the teams sharing it can trust it. That trust has two components: isolation — your workloads are not silently affected by mine — and fairness — I cannot grab all the GPUs while you are waiting in queue. Getting both right on a single Kubernetes cluster is harder than it looks, and the failure modes are specific enough that each deserves a name.

This article maps the four isolation dimensions that matter on an AI platform, explains the Kubernetes primitives available for each, and works through the decision rule for the two primary tenancy models. GPU-specific fairness — fractional scheduling, gang scheduling, and queue depth management — is a sufficiently deep topic that it gets its own treatment in Part 7 (article 28 onward); this article frames the problem and hands off cleanly.

The four isolation dimensions

Multi-tenancy on Kubernetes is not a single knob. It is the intersection of four independent isolation axes, and a gap in any one of them can make the others irrelevant:

  • Compute isolation — one tenant's pods cannot exhaust CPU, memory, or GPU capacity at the expense of another. Enforced by ResourceQuota, LimitRange, and priority classes.
  • Network isolation — tenant A's pods cannot reach tenant B's pods unless an explicit policy permits it. Enforced at Layer 3/4 by NetworkPolicy objects (CNI-dependent) and at Layer 7 by a service mesh.
  • Storage isolation — PersistentVolumeClaims are namespace-scoped and not accessible across tenants; StorageClass selection determines the underlying provisioner and its own access-control model.
  • Control-plane isolation — a tenant's service accounts cannot list or mutate objects in other tenants' namespaces. Enforced by RBAC with namespace-scoped Role and RoleBinding objects. The Kubernetes multi-tenancy documentation calls control-plane isolation the most important type.

The official Kubernetes multi-tenancy documentation distinguishes between "soft" tenancy — trusting teams who share an organisation — and "hard" tenancy — untrusting external customers. [1] An AI platform serving internal ML teams sits squarely in soft-tenancy territory, but that does not mean isolation is optional: a misconfigured training job can saturate cluster networking or exhaust etcd watch budget just as readily as a hostile tenant.

Tenancy model decision rule: namespace-per-tenant vs cluster-per-tenant

The two dominant tenancy models for an internal AI platform are namespace isolation (multiple teams share one cluster, separated by namespaces) and cluster isolation (each major tenant owns a dedicated cluster, joined by a shared control plane or federation layer). Neither is universally correct; the choice follows from at least four criteria.

Choose namespace-per-tenant when:

  1. Teams are mutually trusted (same organisation, same threat model). Namespace boundaries rely on Kubernetes RBAC, which is a software gate, not a hardware one. If a tenant can run arbitrary container code and you cannot trust that code, namespace isolation alone is insufficient.
  2. Operational leverage matters more than hard isolation. A single cluster means a single API server upgrade path, a single NVIDIA GPU Operator DaemonSet, one observability stack, and one node-pool autoscaler. The ops overhead per team scales sub-linearly instead of linearly.
  3. Burst sharing is a feature. If team A's training run finishes early and team B has a pending job, a shared cluster lets team B consume the slack. Separate clusters make that sharing explicit and expensive — you need a federation layer or manual node migration.
  4. You can enforce ResourceQuota and priority classes consistently. Namespace-per-tenant works when an ops team owns the quota allocation and can update it without each tenant having cluster-admin. [2]

Choose cluster-per-tenant when:

  1. Regulatory or contractual requirements demand hard isolation of compute or data residency. Financial services and healthcare contexts may require that training data for tenant A is physically separate from tenant B's compute boundary, not merely namespace-separated.
  2. A tenant requires privileged access to the cluster (cluster-admin, custom admission controllers, mutating webhooks that could affect other namespaces). Granting that access on a shared cluster undermines isolation for every other tenant.
  3. Blast-radius risk is unacceptable. A cluster-level misconfiguration — a broken admission webhook, a node kernel panic triggered by a specific workload — affects all tenants on a shared cluster. If tenant independence is worth the ops overhead, separate clusters enforce it structurally rather than by policy.
  4. The cluster's cost-per-tenant is already high enough that dedicated hardware is the natural unit. GPU clusters with expensive, scarce hardware favour sharing; CPU-only inference clusters may not. The Kubernetes blog's three-tenancy-model survey puts linear ops overhead as the defining cost of cluster-per-tenant. [3]

A hybrid approach — one shared cluster for trusted internal teams, separate clusters for externally-facing or regulated workloads — is a common practical outcome. Virtual clusters (a Kubernetes-native layer that provisions lightweight API servers inside an existing cluster) represent a newer point in the spectrum that trades stronger isolation for lower overhead than fully separate clusters; they are worth evaluating when the namespace model becomes insufficient but the cluster-per-tenant cost is prohibitive.

Compute isolation: ResourceQuota and LimitRange

ResourceQuota is the primary compute-isolation primitive in a shared Kubernetes cluster. A quota object attached to a namespace sets aggregate ceilings on CPU requests and limits, memory requests and limits, GPU requests (via the extended resource key, e.g. nvidia.com/gpu), PVC storage, and object counts. When a new Pod or PersistentVolumeClaim would exceed the namespace quota, the API server rejects it with HTTP 403. [2] This makes quota violations explicit at submit time rather than at scheduling time.

LimitRange works at the container level rather than the namespace aggregate. It injects default resource requests and limits into pods that do not specify them, and enforces minimum and maximum values per container. [4] Without LimitRange, a developer who forgets to set resource requests effectively opts out of the scheduler's bin-packing logic — their pod may land on a node with insufficient headroom and starve other workloads, or it may be scheduled to a GPU node when it needs only CPU.

team-alpha-quota.yaml
# Illustrative ResourceQuota for one team namespace.
# Tune limits to your capacity model; these values are not universal defaults.
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "40"
    limits.cpu: "80"
    requests.memory: 200Gi
    limits.memory: 400Gi
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"
    persistentvolumeclaims: "20"
    requests.storage: 2Ti

One common mistake is setting only limits without requests for GPU resources. The Kubernetes scheduler uses requests for placement decisions; a pod with no GPU request will not trigger a node with a GPU taint, even if the limit is set. On AI platform namespaces, it is good practice to set GPU requests equal to limits — GPUs are not a compressible resource, so overcommitting the limit meaninglessly inflates the namespace quota ceiling without providing a real burst buffer.

Scheduling fairness: QoS classes and priority

Kubernetes assigns every pod to one of three QoS classes based on how its requests and limits are specified. Guaranteed pods have requests equal to limits for every container's CPU and memory; Burstable pods have at least one container with a request below its limit; BestEffort pods specify no requests or limits at all. Under node memory pressure, the kubelet evicts BestEffort pods first, then Burstable, then Guaranteed. [5] For an AI platform, inference deployments that serve live traffic should be Guaranteed; batch training runs can tolerate Burstable; exploratory notebook sessions can run BestEffort if the user accepts interruption.

PriorityClass assigns a numeric priority to pods; the scheduler uses it both for ordering pending pods in the queue and for preemption — evicting lower-priority pods from nodes to make room for higher-priority ones. PriorityClass with preemption has been stable in Kubernetes since v1.14. [6] On a shared AI platform, a sensible three-tier hierarchy maps to workload types rather than teams: production inference services at the top (preemption only in extreme circumstances, typically disabled), scheduled training runs in the middle, and interactive experiments at the bottom.

priority-classes.yaml
# Three-tier PriorityClass hierarchy for an AI platform namespace model.
# Adjust values to fit your cluster's existing system priority range.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ai-inference-production
value: 100000
preemptionPolicy: Never   # inference pods are not evicted; preemption only for critical-system
globalDefault: false
description: "Live inference services. Non-preemptible."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ai-training-scheduled
value: 50000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "Scheduled training jobs. May preempt interactive experiments."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ai-interactive
value: 10000
preemptionPolicy: PreemptLowerPriority
globalDefault: true
description: "Notebooks and ad-hoc experiments. Preemptible."

The noisy-neighbour failure mode in compute isolation almost always originates from a team that either ignores LimitRange defaults or submits a batch job with artificially high priority. Quota catches the aggregate; priority classes catch the scheduling queue. Both together enforce the boundary; neither alone is sufficient.

Network isolation: NetworkPolicy and service mesh

NetworkPolicy objects let you express ingress and egress rules for pods using label selectors and namespace selectors. They are enforced by the CNI plugin — not by kube-proxy or the API server. Not all CNI plugins implement NetworkPolicy; Flannel, for example, does not. Plugins that do include Cilium, Calico, and Weave Net. [7] This matters for platform selection: if your CNI does not enforce NetworkPolicy, the objects are parsed and accepted by the API server but have no runtime effect.

The default-deny pattern is the correct starting posture for a multi-tenant AI platform: apply a NetworkPolicy with an empty podSelector (matching all pods in the namespace) and no ingress or egress rules. This isolates every pod in the namespace from all other pods cluster-wide. You then layer allow-rules on top for the specific communication your workload requires — internal service-to-service, egress to an external model registry, ingress from an API gateway namespace.

default-deny.yaml
# Default-deny baseline for a tenant namespace.
# Apply this first; then add targeted allow policies on top.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: team-alpha
spec:
  podSelector: {}     # matches all pods in namespace
  policyTypes:
  - Ingress
  - Egress
  # No ingress or egress rules => all traffic blocked.

NetworkPolicy operates at Layer 3/4 (IP and port). It does not give you cryptographic workload identity or per-request authorisation. A service mesh layer — such as Istio or Linkerd, to name two implementations — adds mutual TLS between every workload using SPIFFE-format certificates, so that AuthorizationPolicy rules can be scoped to a specific service account in a specific namespace. [8] For an AI platform where model servers handle sensitive outputs, the mesh's cryptographic identity is a meaningful control: it prevents a compromised pod in one namespace from silently impersonating a service account in another even if a NetworkPolicy gap exists.

The operational cost of running a service mesh should not be underestimated: it adds a sidecar (or ambient mode eBPF layer, in more recent designs) to every pod, increases observability surface area, and requires careful certificate rotation. For strictly internal, trusted teams, NetworkPolicy alone may be sufficient. The service mesh becomes worth the overhead when workloads cross trust boundaries — for example, when a shared inference gateway in one namespace routes to model servers owned by separate teams with independent security postures.

Namespace hierarchies and cross-namespace quota propagation

Flat namespaces present a management problem at scale: a team with five sub-projects needs five separate ResourceQuota objects, five separate RBAC bindings, and five separate NetworkPolicies — all maintained in lock-step. The Hierarchical Namespace Controller (HNC), a kubernetes-sigs project, addressed this by letting namespace trees propagate RBAC, NetworkPolicy, and ResourceQuota downward from a parent namespace to its children. HNC reached v1.1.0 in June 2023, and the project was archived in April 2025. [9]

HNC's archival means the pattern it implemented — hierarchical policy propagation — now needs to be reached via alternatives. The two practical paths are: (1) platform tooling that generates and reconciles namespace-level policies from a higher-level abstraction (a GitOps repo of team manifests, a custom operator, or a policy-management layer like Kyverno or OPA Gatekeeper used to propagate baseline policies); and (2) virtual clusters, which give each team its own Kubernetes API server namespace while sharing the underlying node infrastructure, effectively replacing the propagation problem with a cleaner isolation boundary.

Operational note: if your platform was already using HNC v1.1.0, the project entering archived status means no new features or security patches. Evaluate whether the propagation pattern you relied on can be replicated by your policy layer before upgrading Kubernetes to a version where HNC compatibility breaks.

The GPU fairness problem: a preview

The compute isolation mechanics described above — ResourceQuota, LimitRange, PriorityClass — handle CPU and memory well. GPUs introduce a harder problem, because the Kubernetes default-scheduler treats a GPU as an indivisible integer unit. A team with a quota of four GPUs and a pending four-GPU training job will consume all four simultaneously, even if the actual utilisation is below 20% for most of the run.

Two compounding problems make GPU fairness distinct from CPU fairness:

  • GPU time-slicing shares a physical GPU between multiple pods with no memory isolation. One pod can exhaust GPU memory and cause CUDA out-of-memory errors in co-resident pods. [10] This makes time-slicing appropriate for inference but not for training jobs that make large, unpredictable memory allocations.
  • Distributed training jobs (e.g. multi-node, multi-GPU) require gang scheduling: all workers must be schedulable simultaneously or none should be scheduled. A partial allocation blocks the GPUs it holds without making progress, degrading fairness for every other tenant.

Addressing GPU fairness requires a specialised scheduler layer — one that understands gang semantics, can implement per-team GPU quotas that account for fractional sharing, and can provide queue-depth fairness across teams with heterogeneous job sizes. Part 7 of this series (beginning at article 28) covers that scheduling stack in detail, including fractional GPU mechanisms and gang-scheduling primitives (e.g. Volcano, Kueue, and related schedulers). The patterns described in this article — namespace isolation, ResourceQuota, and PriorityClass — are the necessary prerequisite layer, not the complete solution.

Observability across tenant boundaries

Multi-tenancy without per-tenant observability is hard to operate: when a fairness incident occurs, you need to identify quickly which namespace is consuming above quota, which queue priority class is being abused, and whether a network policy gap is generating unexpected cross-namespace traffic.

The minimal observable set for a multi-tenant AI platform namespace includes: CPU and memory request-versus-usage delta per namespace (detects underprovisioned LimitRange defaults), GPU utilisation per namespace and per pod (exposes idle-hold patterns before ResourceQuota catches them at admission), and NetworkPolicy hit counts per namespace (surfaces misconfigured policies silently dropping traffic). These metrics are available from standard Kubernetes metrics-server and the DCGM exporter for GPU counters, plumbed into your metrics pipeline (e.g. Prometheus, or a managed equivalent) and visualised with namespace as a label dimension.

Summary: the isolation stack

A working multi-tenancy posture for a shared AI platform combines all four isolation dimensions. Control-plane isolation (RBAC with namespace-scoped roles) prevents tenant A from reading or mutating tenant B's objects. Compute isolation (ResourceQuota + LimitRange + PriorityClass) prevents one team from crowding out another in the scheduler queue or exhausting node resources. Network isolation (default-deny NetworkPolicy, CNI enforcement, optional service mesh for cryptographic identity) prevents lateral movement between tenant namespaces. Storage isolation (namespace-scoped PVCs, StorageClass access control) prevents cross-tenant data access at the storage layer.

None of these mechanisms is optional if you intend the platform to be shared in a meaningful sense. They are also layered: a gap in one dimension — an overly permissive NetworkPolicy, a missing LimitRange — does not invalidate the others, but it does create a surface that an over-allocated or misconfigured workload will eventually exploit unintentionally. The noisy-neighbour problem on an AI platform is almost always accidental rather than adversarial; the defence is comprehensive policy coverage, not paranoia.

References

  1. Kubernetes Project. "Multi-tenancy." Kubernetes Documentation, 2024. https://kubernetes.io/docs/concepts/security/multi-tenancy/
  2. Kubernetes Project. "Resource Quotas." Kubernetes Documentation, 2024. https://kubernetes.io/docs/concepts/policy/resource-quotas/
  3. Kubernetes Blog. "Three Tenancy Models For Kubernetes." April 2021. https://kubernetes.io/blog/2021/04/15/three-tenancy-models-for-kubernetes/
  4. Kubernetes Project. "Limit Ranges." Kubernetes Documentation, 2024. https://kubernetes.io/docs/concepts/policy/limit-range/
  5. Kubernetes Project. "Quality of Service for Pods." Kubernetes Documentation, 2024. https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/
  6. Kubernetes Project. "Pod Priority and Preemption." Kubernetes Documentation, 2024. https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/
  7. Kubernetes Project. "Network Policies." Kubernetes Documentation, 2024. https://kubernetes.io/docs/concepts/services-networking/network-policies/
  8. Istio Project. "Mutual TLS Migration." Istio Documentation, 2024. https://istio.io/latest/docs/tasks/security/authentication/mtls-migration/
  9. kubernetes-sigs. "Hierarchical Namespace Controller (HNC)." GitHub, last release v1.1.0 June 2023, archived April 2025. https://github.com/kubernetes-sigs/hierarchical-namespaces
  10. Loft Labs (vCluster). "GPU Multitenancy in Kubernetes: Strategies and Solutions." 2024. https://www.vcluster.com/blog/gpu-multitenancy-kubernetes-strategies

Tags

#multi-tenancy#fairness#kubernetes#series:ai-platform-mlops#series-order/24

About the Author

asleekgeek

asleekgeek

Senior Developer, Architect, DevOps

Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist

Thanks for reading! Explore more articles.

Back to Articles