AI Platform Engineering & MLOps · Part XXIV of 34

Multi-tenancy on a shared AI platform — quotas, fairness, and the noisy-neighbour problem

How to share a Kubernetes-based AI platform across teams without letting one tenant starve the others — covering isolation models, quota mechanics, priority classes, and network boundaries.

12 min read·2 interactive components·10 references

Team namespace (quota bar)Quota exceeded (noisy neighbour)Active GPU nodeIdle GPU node

A shared AI platform is only useful if the teams sharing it can trust it. That trust has two components: isolation — your workloads are not silently affected by mine — and fairness — I cannot grab all the GPUs while you are waiting in queue. Getting both right on a single Kubernetes cluster is harder than it looks, and the failure modes are specific enough that each deserves a name.

This article maps the four isolation dimensions that matter on an AI platform, explains the Kubernetes primitives available for each, and works through the decision rule for the two primary tenancy models. GPU-specific fairness — fractional scheduling, gang scheduling, and queue-depth management — is a sufficiently deep topic that it gets its own treatment in Part 7 (article 28 onward); this article frames the problem and hands off cleanly.

The four isolation dimensions

Multi-tenancy on Kubernetes is not a single knob. It is the intersection of four independent isolation axes, and a gap in any one of them can make the others irrelevant:

Compute isolation — one tenant's pods cannot exhaust CPU, memory, or GPU capacity at the expense of another. Enforced by ResourceQuota, LimitRange, and priority classes.
Network isolation — tenant A's pods cannot reach tenant B's pods unless an explicit policy permits it. Enforced at Layer 3/4 by NetworkPolicy objects (CNI-dependent) and at Layer 7 by a service mesh.
Storage isolation — PersistentVolumeClaims are namespace-scoped and not accessible across tenants; StorageClass selection determines the underlying provisioner and its own access-control model.
Control-plane isolation — a tenant's service accounts cannot list or mutate objects in other tenants' namespaces. Enforced by RBAC with namespace-scoped Role and RoleBinding objects. The Kubernetes multi-tenancy documentation calls control-plane isolation the most important type.

The official Kubernetes multi-tenancy documentation distinguishes between “soft” tenancy — trusting teams who share an organisation — and “hard” tenancy — untrusting external customers. [1] An AI platform serving internal ML teams sits squarely in soft-tenancy territory, but that does not mean isolation is optional: a misconfigured training job can saturate cluster networking or exhaust etcd watch budget just as readily as a hostile tenant.

Tenancy model decision rule: namespace-per-tenant vs cluster-per-tenant

The two dominant tenancy models for an internal AI platform are namespace isolation (multiple teams share one cluster, separated by namespaces) and cluster isolation (each major tenant owns a dedicated cluster, joined by a shared control plane or federation layer). Neither is universally correct; the choice follows from at least four criteria.

Choose namespace-per-tenant when:

1Teams are mutually trusted (same organisation, same threat model). Namespace boundaries rely on Kubernetes RBAC, which is a software gate, not a hardware one. If a tenant can run arbitrary container code and you cannot trust that code, namespace isolation alone is insufficient.
2Operational leverage matters more than hard isolation. A single cluster means a single API server upgrade path, a single NVIDIA GPU Operator DaemonSet, one observability stack, and one node-pool autoscaler. The ops overhead per team scales sub-linearly instead of linearly.
3Burst sharing is a feature. If team A's training run finishes early and team B has a pending job, a shared cluster lets team B consume the slack. Separate clusters make that sharing explicit and expensive — you need a federation layer or manual node migration.
4You can enforce ResourceQuota and priority classes consistently. Namespace-per-tenant works when an ops team owns the quota allocation and can update it without each tenant having cluster-admin. [2]

Choose cluster-per-tenant when:

1Regulatory or contractual requirements demand hard isolation of compute or data residency. Financial services and healthcare contexts may require that training data for tenant A is physically separate from tenant B's compute boundary, not merely namespace-separated.
2A tenant requires privileged access to the cluster (cluster-admin, custom admission controllers, mutating webhooks that could affect other namespaces). Granting that access on a shared cluster undermines isolation for every other tenant.
3Blast-radius risk is unacceptable. A cluster-level misconfiguration — a broken admission webhook, a node kernel panic triggered by a specific workload — affects all tenants on a shared cluster. If tenant independence is worth the ops overhead, separate clusters enforce it structurally rather than by policy.
4The cluster’s cost-per-tenant is already high enough that dedicated hardware is the natural unit. GPU clusters with expensive, scarce hardware favour sharing; CPU-only inference clusters may not. The Kubernetes blog’s three-tenancy-model survey puts linear ops overhead as the defining cost of cluster-per-tenant. [3]

A hybrid approach — one shared cluster for trusted internal teams, separate clusters for externally-facing or regulated workloads — is a common practical outcome. Virtual clusters (a Kubernetes-native layer that provisions lightweight API servers inside an existing cluster) represent a newer point in the spectrum that trades stronger isolation for lower overhead than fully separate clusters; they are worth evaluating when the namespace model becomes insufficient but the cluster-per-tenant cost is prohibitive.

Compute isolation: ResourceQuota and LimitRange

ResourceQuota is the primary compute-isolation primitive in a shared Kubernetes cluster. A quota object attached to a namespace sets aggregate ceilings on CPU requests and limits, memory requests and limits, GPU requests (via the extended resource key, e.g. nvidia.com/gpu), PVC storage, and object counts. When a new Pod or PersistentVolumeClaim would exceed the namespace quota, the API server rejects it with HTTP 403. [2] This makes quota violations explicit at submit time rather than at admission time.

LimitRange works at the container level rather than the namespace aggregate. It injects default resource requests and limits into pods that do not specify them, and enforces minimum and maximum values per container. [4] Without LimitRange, a developer who forgets to set resource requests effectively opts out of the scheduler’s bin-packing logic — their pod may land on a node with insufficient headroom and starve other workloads, or it may be scheduled to a GPU node when it needs only CPU.

team-alpha-quota.yaml

# Illustrative ResourceQuota for one team namespace.
# Tune limits to your capacity model; these values are not universal defaults.
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "40"
    limits.cpu: "80"
    requests.memory: 200Gi
    limits.memory: 400Gi
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"
    persistentvolumeclaims: "20"
    requests.storage: 2Ti

One common mistake is setting only limits without requests for GPU resources. The Kubernetes scheduler uses requests for placement decisions; a pod with no GPU request will not trigger a node with a GPU taint, even if the limit is set. On AI platform namespaces, it is good practice to set GPU requests equal to limits — GPUs are not a compressible resource, so overcommitting the limit meaninglessly inflates the namespace quota ceiling without providing a real burst buffer.

The policy builder below lets you enable isolation controls across all four dimensions and immediately see a readiness score. Click any YAML button to preview the corresponding manifest.

Isolation Policy Builder

Enable isolation controls across the four dimensions. Your readiness score updates immediately. Click a control’s YAML button to preview the manifest.

Isolation Readiness0% — Minimal

Enable compute and network isolation as a minimum for any shared cluster.

Compute Isolation

Prevents a tenant from exhausting CPU, memory, or GPU capacity.

+20pts

+10pts

Scheduling Fairness

Controls preemption priority and QoS class assignment.

+15pts

+10pts

Network Isolation

Prevents lateral movement between tenant namespaces.

+20pts

+15pts

Control-Plane Isolation

Prevents tenant A from reading or mutating tenant B's objects.

+10pts

Scheduling fairness: QoS classes and priority

Kubernetes assigns every pod to one of three QoS classes based on how its requests and limits are specified. Guaranteed pods have requests equal to limits for every container’s CPU and memory; Burstable pods have at least one container with a request below its limit; BestEffort pods specify no requests or limits at all. Under node memory pressure, the kubelet evicts BestEffort pods first, then Burstable, then Guaranteed. [5] For an AI platform, inference deployments that serve live traffic should be Guaranteed; batch training runs can tolerate Burstable; exploratory notebook sessions can run BestEffort if the user accepts interruption.

PriorityClass assigns a numeric priority to pods; the scheduler uses it both for ordering pending pods in the queue and for preemption — evicting lower-priority pods from nodes to make room for higher-priority ones. PriorityClass with preemption has been stable in Kubernetes since v1.14. [6] On a shared AI platform, a sensible three-tier hierarchy maps to workload types rather than teams: production inference services at the top (preemption only in extreme circumstances, typically disabled), scheduled training runs in the middle, and interactive experiments at the bottom.

priority-classes.yaml

# Three-tier PriorityClass hierarchy for an AI platform namespace model.
# Adjust values to fit your cluster's existing system priority range.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ai-inference-production
value: 100000
preemptionPolicy: Never   # inference pods are not evicted
globalDefault: false
description: "Live inference services. Non-preemptible."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ai-training-scheduled
value: 50000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "Scheduled training jobs. May preempt interactive experiments."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: ai-interactive
value: 10000
preemptionPolicy: PreemptLowerPriority
globalDefault: true
description: "Notebooks and ad-hoc experiments. Preemptible."

The noisy-neighbour failure mode in compute isolation almost always originates from a team that either ignores LimitRange defaults or submits a batch job with artificially high priority. Quota catches the aggregate; priority classes catch the scheduling queue. Both together enforce the boundary; neither alone is sufficient.

Network isolation: NetworkPolicy and service mesh

NetworkPolicy objects let you express ingress and egress rules for pods using label selectors and namespace selectors. They are enforced by the CNI plugin — not by kube-proxy or the API server. Not all CNI plugins implement NetworkPolicy; Flannel, for example, does not. Plugins that do include Cilium, Calico, and Weave Net. [7] This matters for platform selection: if your CNI does not enforce NetworkPolicy, the objects are parsed and accepted by the API server but have no runtime effect.

The default-deny pattern is the correct starting posture for a multi-tenant AI platform: apply a NetworkPolicy with an empty podSelector (matching all pods in the namespace) and no ingress or egress rules. This isolates every pod in the namespace from all other pods cluster-wide. You then layer allow-rules on top for the specific communication your workload requires — internal service-to-service, egress to an external model registry, ingress from an API gateway namespace.

default-deny.yaml

# Default-deny baseline for a tenant namespace.
# Apply this first; then add targeted allow policies on top.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: team-alpha
spec:
  podSelector: {}     # matches all pods in namespace
  policyTypes:
  - Ingress
  - Egress
  # No ingress or egress rules => all traffic blocked.

NetworkPolicy operates at Layer 3/4 (IP and port). It does not give you cryptographic workload identity or per-request authorisation. A service mesh layer — such as Istio or Linkerd — adds mutual TLS between every workload using SPIFFE-format certificates, so that AuthorizationPolicy rules can be scoped to a specific service account in a specific namespace. [8] For an AI platform where model servers handle sensitive outputs, the mesh’s cryptographic identity is a meaningful control: it prevents a compromised pod in one namespace from silently impersonating a service account in another even if a NetworkPolicy gap exists.

The operational cost of running a service mesh should not be underestimated: it adds a sidecar (or ambient-mode eBPF layer, in more recent designs) to every pod, increases observability surface area, and requires careful certificate rotation. For strictly internal, trusted teams, NetworkPolicy alone may be sufficient. The service mesh becomes worth the overhead when workloads cross trust boundaries — for example, when a shared inference gateway in one namespace routes to model servers owned by separate teams with independent security postures.

Namespace hierarchies and cross-namespace quota propagation

Flat namespaces present a management problem at scale: a team with five sub-projects needs five separate ResourceQuota objects, five separate RBAC bindings, and five separate NetworkPolicies — all maintained in lock-step. The Hierarchical Namespace Controller (HNC), a kubernetes-sigs project, addressed this by letting namespace trees propagate RBAC, NetworkPolicy, and ResourceQuota downward from a parent namespace to its children. HNC reached v1.1.0 in June 2023, and the project was archived in April 2025. [9]

HNC’s archival means the pattern it implemented — hierarchical policy propagation — now needs to be reached via alternatives. The two practical paths are: (1) platform tooling that generates and reconciles namespace-level policies from a higher-level abstraction (a GitOps repo of team manifests, a custom operator, or a policy-management layer like Kyverno or OPA Gatekeeper used to propagate baseline policies); and (2) virtual clusters, which give each team its own Kubernetes API server namespace while sharing the underlying node infrastructure, effectively replacing the propagation problem with a cleaner isolation boundary.

Operational note: if your platform was already using HNC v1.1.0, the project entering archived status means no new features or security patches. Evaluate whether the propagation pattern you relied on can be replicated by your policy layer before upgrading Kubernetes to a version where HNC compatibility breaks.

The GPU fairness problem: a preview

The compute isolation mechanics described above — ResourceQuota, LimitRange, PriorityClass — handle CPU and memory well. GPUs introduce a harder problem, because the Kubernetes default-scheduler treats a GPU as an indivisible integer unit. A team with a quota of four GPUs and a pending four-GPU training job will consume all four simultaneously, even if the actual utilisation is below 20% for most of the run.

Two compounding problems make GPU fairness distinct from CPU fairness:

GPU time-slicing shares a physical GPU between multiple pods with no memory isolation. One pod can exhaust GPU memory and cause CUDA out-of-memory errors in co-resident pods. [10] This makes time-slicing appropriate for inference but not for training jobs that make large, unpredictable memory allocations.
Distributed training jobs (e.g. multi-node, multi-GPU) require gang scheduling: all workers must be schedulable simultaneously or none should be scheduled. A partial allocation blocks the GPUs it holds without making progress, degrading fairness for every other tenant.

Addressing GPU fairness requires a specialised scheduler layer — one that understands gang semantics, can implement per-team GPU quotas that account for fractional sharing, and can provide queue-depth fairness across teams with heterogeneous job sizes. Part 7 of this series (beginning at article 28) covers that scheduling stack in detail, including fractional GPU mechanisms and gang-scheduling primitives (e.g. Volcano, Kueue, and related schedulers). The patterns described in this article — namespace isolation, ResourceQuota, and PriorityClass — are the necessary prerequisite layer, not the complete solution.

The simulator below lets you observe what happens when four teams share a GPU pool with no policies in place, then progressively enable ResourceQuota, PriorityClass, and Kueue fair-share to see how average queue wait times change.

Noisy-Neighbour Simulator

Four teams share 8 GPUs. Toggle fairness policies to see how queue wait times change. Team α submits large 4-GPU jobs; Teams β–δ submit smaller jobs.

Tick 0

GPU pool — 0/8 in use

Team α (training)GPUs: 0/8priority: mediumrunning: 0 · queued: 1 · avg wait: 0t

Team β (fine-tune)GPUs: 0/8priority: mediumrunning: 0 · queued: 1 · avg wait: 0t

Team γ (inference)GPUs: 0/8priority: highrunning: 0 · queued: 1 · avg wait: 0t

Team δ (notebook)GPUs: 0/8priority: lowrunning: 0 · queued: 1 · avg wait: 0t

Team	Jobs done	Avg wait (ticks)	Currently queued
α	0	0	1
β	0	0	1
γ	0	0	1
δ	0	0	1

Tip: run without policies first, then enable ResourceQuota to cap Team α, then enable Fair-share to observe queue-wait equalisation.

Observability across tenant boundaries

Multi-tenancy without per-tenant observability is hard to operate: when a fairness incident occurs, you need to identify quickly which namespace is consuming above quota, which queue priority class is being abused, and whether a network policy gap is generating unexpected cross-namespace traffic.

The minimal observable set for a multi-tenant AI platform namespace includes: CPU and memory request-versus-usage delta per namespace (detects underprovisioned LimitRange defaults), GPU utilisation per namespace and per pod (exposes idle-hold patterns before ResourceQuota catches them at admission), and NetworkPolicy hit counts per namespace (surfaces misconfigured policies silently dropping traffic). These metrics are available from standard Kubernetes metrics-server and the DCGM exporter for GPU counters, plumbed into your metrics pipeline (e.g. Prometheus, or a managed equivalent) and visualised with namespace as a label dimension.

Summary: the isolation stack

A working multi-tenancy posture for a shared AI platform combines all four isolation dimensions. Control-plane isolation (RBACwith namespace-scoped roles) prevents tenant A from reading or mutating tenant B’s objects. Compute isolation (ResourceQuota + LimitRange + PriorityClass) prevents one team from crowding out another in the scheduler queue or exhausting node resources. Network isolation (default-deny NetworkPolicy, CNI enforcement, optional service mesh for cryptographic identity) prevents lateral movement between tenant namespaces. Storage isolation (namespace-scoped PVCs, StorageClass access control) prevents cross-tenant data access at the storage layer.

None of these mechanisms is optional if you intend the platform to be shared in a meaningful sense. They are also layered: a gap in one dimension — an overly permissive NetworkPolicy, a missing LimitRange — does not invalidate the others, but it does create a surface that an over-allocated or misconfigured workload will eventually exploit unintentionally. The noisy-neighbour problem on an AI platform is almost always accidental rather than adversarial; the defence is comprehensive policy coverage, not paranoia.

Dimension	Primitive	Scope	Gap failure mode
Compute	ResourceQuota + LimitRange + PriorityClass	Namespace	One team starves all GPUs; training OOM evicts inference pods
Network	NetworkPolicy (L3/4) + service mesh (L7 mTLS)	Namespace / pod	Lateral movement; compromised pod reaches other namespaces
Storage	Namespace-scoped PVCs + StorageClass ACLs	Namespace / PVC	Cross-tenant training data access; data leakage between model registries
Control-plane	RBAC (Role + RoleBinding, namespace-scoped)	Namespace	Tenant A deletes tenant B's jobs; service account token escalation

References

[1] Kubernetes Project. "Multi-tenancy." Kubernetes Documentation, 2024. kubernetes.io/docs/concepts/security/multi-tenancy
[2] Kubernetes Project. "Resource Quotas." Kubernetes Documentation, 2024. kubernetes.io/docs/concepts/policy/resource-quotas
[3] Kubernetes Blog. "Three Tenancy Models For Kubernetes." April 2021. kubernetes.io/blog/2021/04/15/three-tenancy-models-for-kubernetes
[4] Kubernetes Project. "Limit Ranges." Kubernetes Documentation, 2024. kubernetes.io/docs/concepts/policy/limit-range
[5] Kubernetes Project. "Quality of Service for Pods." Kubernetes Documentation, 2024. kubernetes.io/docs/concepts/workloads/pods/pod-qos
[6] Kubernetes Project. "Pod Priority and Preemption." Kubernetes Documentation, 2024. kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption
[7] Kubernetes Project. "Network Policies." Kubernetes Documentation, 2024. kubernetes.io/docs/concepts/services-networking/network-policies
[8] Istio Project. "Mutual TLS Migration." Istio Documentation, 2024. istio.io/latest/docs/tasks/security/authentication/mtls-migration
[9] kubernetes-sigs. "Hierarchical Namespace Controller (HNC)." GitHub, last release v1.1.0 June 2023, archived April 2025. github.com/kubernetes-sigs/hierarchical-namespaces
[10] Loft Labs (vCluster). "GPU Multitenancy in Kubernetes: Strategies and Solutions." 2024. vcluster.com/blog/gpu-multitenancy-kubernetes-strategies

Continue the Journey

AI Platform

Multi-tenancy on a shared AI platform — quotas, fairness, and the noisy-neighbour problem

The four isolation dimensions

Tenancy model decision rule: namespace-per-tenant vs cluster-per-tenant

Choose namespace-per-tenant when:

Choose cluster-per-tenant when:

Compute isolation: ResourceQuota and LimitRange

Isolation Policy Builder

Compute Isolation

Scheduling Fairness

Network Isolation

Control-Plane Isolation

Scheduling fairness: QoS classes and priority

Network isolation: NetworkPolicy and service mesh

Namespace hierarchies and cross-namespace quota propagation

The GPU fairness problem: a preview

Noisy-Neighbour Simulator

Observability across tenant boundaries

Summary: the isolation stack

References

Continue the Journey

The GPU scheduling stack: queue admission, gang scheduling, and hardware abstraction

FinOps for AI: The Showback-to-Chargeback Ladder and Unit Economics That Actually Work

Golden paths for ML — paved-road templates that survive contact with users

KEDA Autoscaling