The AI gateway — what it is, when you need one, and where it sits

The AI gateway sits between every application and every upstream LLM endpoint.
An AI gateway is a reverse proxy that sits between every application and every upstream LLM endpoint. It is the single control point for API-key management, spend enforcement, semantic caching, routing and failover, guardrails, and the OpenTelemetry GenAI observability tap. Without one, each team manages its own keys, enforces its own rate limits, and implements its own fallback logic — producing key sprawl, invisible spend until the monthly invoice, and fragmented traces that stop at the application boundary.
This article covers the platform-level decision: which gateway pattern fits your deployment posture, how the six core gateway capabilities map across four representative OSS and managed options, and where the gateway boundary ends and the MCP tool registry begins.
Why the gateway is a platform primitive, not an application concern
When AI API calls are made directly from application code, the key management and policy enforcement land on each team individually. The compounding costs are predictable:
- Key sprawl: provider credentials stored in multiple namespaces, secret stores, and CI pipelines with no central revocation path.
- Spend invisibility: no cross-service token accounting until the cloud bill arrives at month end.
- Duplicated guardrail logic: each team ships its own PII filter or jailbreak detector, with no shared policy and divergent coverage.
- Fragmented observability: traces stop at the application boundary, so retry events, cache hits, and routing decisions are invisible to the platform team.
Centralising these concerns at the gateway lets application teams focus on prompt engineering and evaluation rather than infrastructure plumbing. The gateway is the LLM-traffic equivalent of a service mesh control plane: invisible to the happy path, essential on the failure path.
Decision tree: when you need a gateway versus a thin proxy
Not every deployment needs a full gateway. Work through these questions in order.
- Do you have more than one team or service calling LLMs? If yes, you need at minimum a shared key-management layer — and a gateway provides that with spend isolation per virtual key.
- Do you need spend enforcement before the invoice? A thin reverse proxy (e.g. a simple nginx rule) cannot track token counts. You need a gateway that understands the LLM API response body.
- Do you have data-residency or egress constraints? On-prem workloads with air-gap requirements point directly to a self-hosted, open-source gateway — managed SaaS options are off the table regardless of their feature richness.
- Do you already run an API gateway for REST/gRPC traffic? If yes, an AI plugin layer on the existing gateway avoids a second control plane. If no, a purpose-built LLM proxy is typically lower friction to start.
- Is your call volume highly repetitive (e.g. a help-desk bot with a bounded question set)? Semantic caching becomes cost-justified at moderate repetition rates — published benchmarks report hit rates up to 68.8% at an optimal cosine-similarity threshold of 0.8 [1]. A thin proxy cannot do this.
If all five answers are "no," a thin reverse proxy or direct SDK calls may be adequate for an early prototype. For anything in production with more than one team, a dedicated gateway component is the right platform investment.
Gateway comparison: four options across six capabilities
The table below covers four representative gateways across the six capabilities a platform team needs. All four expose an OpenAI-compatible REST surface; the differences are operational and organisational.
Capabilities rated: Key mgmt = API key virtualisation and rotation; Spend = per-key budget cap and alerting; Caching = semantic cache with configurable similarity threshold; Routing = priority-ordered provider list with automatic failover; Guardrails = built-in input/output policy pipeline; OTel = native OpenTelemetry GenAI span export.
| Gateway | Key mgmt | Spend | Caching | Routing | Guardrails | OTel | Self-host? | Licence |
|-------------------|----------|-------|---------|---------|------------|-------|------------|----------------------|
| Kong AI Gateway | ✓ | ✓ | ✓ plugin| ✓ | ✓ plugin | ✓ | Yes | Apache 2 / Enterprise|
| LiteLLM Proxy | ✓ | ✓ | ✓ | ✓ | pluggable | ✓ | Yes | MIT |
| Portkey | ✓ | ✓ | ✓ | ✓ | ✓ built-in | ✓ | Agent mode | Proprietary |
| Apigee AI Gateway | ✓ | ✓ | ✓ policy| ✓ | ✓ policy | ✓ | Hybrid | GCP pay-per-use |Kong AI Gateway
Kong AI Gateway is the AI plugin suite layered on top of Kong Gateway. If a platform team already runs Kong for REST or gRPC traffic, the AI plugins (rate-limiting-advanced, ai-proxy, ai-prompt-guard, ai-semantic-caching) add without a separate control plane [2]. The key management plugin integrates with external secret stores (e.g. HashiCorp Vault) so upstream provider credentials never reside in the plugin configuration.
Best fit: teams with an existing Kong deployment who want to avoid a second API gateway control plane. The semantic caching plugin requires a vector-capable backend such as Redis Stack [2]. Full token-level OTel span attributes require either Kong Enterprise or manual plugin composition.
LiteLLM Proxy
Best fit: air-gapped or hybrid environments where open-weights models (served by an inference runtime such as vLLM or llama.cpp) are the upstream, and where the MIT licence and absence of egress requirements matter. The guardrail pipeline is pluggable but not batteries-included; teams compose validators such as Llama Guard themselves.
Portkey
Portkey is a developer-experience-focused gateway with a managed SaaS control plane and an optional self-hosted proxy agent. Its headline differentiator is a pre-built guardrail library (PII detectors, topic filters, toxicity classifiers) that attaches to any route without custom code. Semantic caching uses a configurable cosine-similarity threshold and is documented as a gateway-level feature [5].
Best fit: multi-tenant SaaS platforms where the built-in guardrail library substantially reduces time to a production-safe route. In agent mode the data path stays on-cluster; only metadata reaches the SaaS control plane. Fully air-gapped deployments require the enterprise self-hosted tier. The proprietary licence introduces vendor-dependency risk that should be weighed against the guardrail convenience.
Apigee AI Gateway
Apigee AI Gateway adds LLM-aware policies on top of the Apigee X/Hybrid API management platform. Token rate-limiting is handled by dedicated PromptTokenLimit and LLMTokenQuota policies [6]. The Model Armor layer adds SanitizeUserPrompt, SanitizeModelResponse, and SemanticCacheLookup policies [7]. Apigee Hybrid can run the data plane on a Kubernetes-hosted runtime, extending policy enforcement to on-premises nodes.
Best fit: GCP-centric stacks where Apigee is already the API governance layer and teams want a single policy store for AI and non-AI traffic. The licensing cost is significant; it is not the right default for teams without existing Apigee spend.
Cross-cutting gateway concerns
These concerns apply regardless of which gateway you choose.
Key and budget management
Every gateway virtualises upstream provider keys into short-lived, scope-limited tokens issued to application workloads. The canonical pattern: one upstream key per provider per environment, stored in a secrets manager; the gateway mints virtual keys with budget caps and revokes them independently of the upstream credential. A runaway agent loop or a compromised virtual key can be cut off without rotating the provider secret or redeploying sibling services.
Budget enforcement at the gateway — not at the application — is the only way to get cross-service spend visibility before the invoice. LiteLLM's virtual-key model is a well-documented concrete implementation of this pattern: per-key budget caps, per-key spend tracked in a Postgres sidecar, duration-based reset, and model restrictions are all configurable at key-issuance time [4].
Semantic caching
A semantic cache stores the embedding of a prompt and returns a cached response when a new prompt falls within a configurable cosine-similarity threshold. Research on GPT-based semantic caching found that a threshold of 0.8 is empirically effective, with cache hit rates up to 68.8% at the optimal threshold and a positive-hit rate exceeding 97% [1]. That translates directly to cost reduction on use cases with repetitive traffic — help-desk bots, FAQ assistants, and code-completion pipelines are common examples.
Three operating dials matter most:
- Similarity threshold: too loose and you serve a cached answer to a meaningfully different question; too tight and hit rate collapses. Start conservative (0.9+) and loosen while watching evaluation pass rates, not just hit rate.
- What to cache: cache deterministic, low-stakes lookups aggressively. Do not cache responses for prompts carrying per-user data or time-sensitive information — a stale cache hit there is a correctness and privacy bug.
- Guardrail placement: output guardrails must sit on the response path after the cache. A cache hit bypasses the model entirely, so any guardrail wired only inside the model call is skipped on cached responses.
Routing and failover
A gateway maintains a provider list with priority and fallback rules. When the primary endpoint returns a 429 or 5xx, the gateway retries the next provider transparently to the application. This is the mechanism that makes multi-model strategies operationally safe: a model deprecation or a provider outage is a configuration change at the gateway, not a code change across every service.
A typical priority ordering for a hybrid posture (cloud-managed Kubernetes + on-premises Kubernetes) might be: (1) a managed cloud LLM endpoint for low latency and data-residency compliance; (2) the provider's direct API for rate-limit headroom; (3) a self-hosted open-weights model as the cost floor and air-gapped fallback. The gateway makes this transparent without the application being aware of the topology.
Guardrails
Guardrails intercept the request before it reaches the model (input guardrails) and the response before it reaches the application (output guardrails). Common categories:
- PII detection: redact or reject prompts containing personal data before the token leaves the cluster.
- Topic filters: restrict requests to an allowed topic set — useful for scoped assistant products.
- Jailbreak and prompt-injection detection: classifier or rule-based detection of attempts to override model behaviour, including indirect injection via tool output.
- Schema enforcement: validate that model output conforms to a declared JSON schema when a downstream tool call depends on structured output.
Guardrail pipelines add latency; profile with realistic traffic before enabling in the critical path. The operating rule: fail closed on destructive tool paths, fail open on read-only paths. Encode that per route rather than globally.
Observability: OTel GenAI semantic conventions
The OpenTelemetry project defines standard span attributes for generative AI client calls under the GenAI semantic conventions [8][9]. Emitting these spans at the gateway — rather than per application — is the most practical attachment point: the gateway sees every request including retries, cache hits, and routing decisions that application-level instrumentation misses.
A conformant span carries at minimum:
# Minimum OTel GenAI span attributes (per semconv gen-ai-spans spec)
gen_ai.system: "openai" # provider discriminator (gen_ai.provider.name in newer versions)
gen_ai.request.model: "gpt-4o" # model requested by the application
gen_ai.response.model: "gpt-4o" # model actually used (may differ after gateway routing)
gen_ai.usage.input_tokens: 412 # prompt token count
gen_ai.usage.output_tokens: 87 # completion token count
gen_ai.operation.name: "chat" # "chat", "embeddings", "completions", etc.The distinction between gen_ai.request.model and gen_ai.response.model matters in a gateway context: if the gateway reroutes from the requested model to a fallback, the two fields diverge and the mismatch is the signal that a routing event occurred. This is data you cannot reconstruct from application-level traces alone.
The seam to the MCP tool registry
The Model Context Protocol (MCP) is an open standard — based on JSON-RPC 2.0 — that lets an agent runtime discover and invoke tools through a uniform interface [10][11]. An MCP tool registry is a catalogue of MCP servers your agents are permitted to shop from. The gateway and the registry are complementary, not competing:
- The gateway governs LLM traffic: which service can call which model, at what token cost, through which guardrail policy.
- The MCP registry governs tool availability: which tools an agent can discover, under which permissions.
The seam is the agent runtime. An agent loop calls the MCP registry to discover tools, calls the gateway to route LLM inference, and calls the discovered tool endpoints directly. The gateway does not proxy MCP tool calls; it proxies the LLM inference calls that happen between tool calls in the agent loop. Keeping that boundary clear prevents tool payloads from accidentally flowing through token-budget accounting.
Platform teams own both access-control surfaces: the gateway policy (who can spend on which model) and the registry policy (which tools are available to which service account). Aligning these two models — so that a service account's tool permissions and its token budget are consistent — is the day-two concern that the next article in this series covers under the MCP tool registry pattern.
Deployment posture guidance
The gateway comparison table covers organisational fit; the deployment posture determines which options are even available:
- Pure cloud (managed Kubernetes only): all four options are viable. If an API gateway already exists in the stack, extend it with AI plugins before adding a second control plane.
- Hybrid (on-premises Kubernetes + managed cloud): prefer a gateway that can span both clusters from a single configuration surface. Options with Helm-deployable data planes and cloud-hosted control planes (or fully self-hosted) handle this natively. Managed SaaS-only options require a self-hosted agent for the on-premises leg.
- Air-gapped (no public egress): only fully self-hosted OSS options qualify. A gateway with a MIT or Apache-2 licence and no mandatory call-home is the only viable category. This eliminates managed SaaS control planes entirely.
In practice, most organisations start with a self-hosted OSS proxy (lowest friction, no licence cost) and migrate to a richer gateway once caching, multi-cluster routing, or guardrail requirements justify the operational overhead. The virtual-key and OTel span contract are stable across the migration because they are standard patterns, not gateway-specific APIs.
References
[1] Gim et al., "GPT Semantic Cache: Reducing LLM Costs and Latency via Semantic Embedding Caching," arXiv:2411.05276v2, 2024. https://arxiv.org/html/2411.05276v2
[2] Kong Inc., "AI Gateway — Plugin Hub (AI category)," Kong Docs, 2024. https://docs.konghq.com/hub/?category=ai
[3] LiteLLM, "LiteLLM AI Gateway (LLM Proxy) — Getting Started," LiteLLM Docs. https://docs.litellm.ai/docs/simple_proxy
[4] LiteLLM, "Virtual Keys — Budget Caps and Spend Tracking," LiteLLM Docs. https://docs.litellm.ai/docs/proxy/virtual_keys
[5] Portkey, "Cache (Simple and Semantic)," Portkey Docs. https://portkey.ai/docs/product/ai-gateway/cache-simple-and-semantic
[6] Google Cloud, "Get started with LLM token policies | Apigee," Google Cloud Docs, 2025. https://docs.cloud.google.com/apigee/docs/api-platform/tutorials/using-ai-token-policies
[7] Google Cloud, "Get started with Apigee Model Armor policies," Google Cloud Docs, 2025. https://docs.cloud.google.com/apigee/docs/api-platform/tutorials/using-model-armor-policies
[8] OpenTelemetry, "Semantic conventions for generative client AI spans," OpenTelemetry Specification. https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
[9] OpenTelemetry, "Semantic conventions for generative AI systems," OpenTelemetry Specification. https://opentelemetry.io/docs/specs/semconv/gen-ai/
[10] Anthropic, "Introducing the Model Context Protocol," Anthropic Blog, November 2024. https://www.anthropic.com/news/model-context-protocol
[11] Model Context Protocol, "MCP Specification 2025-11-25," modelcontextprotocol.io. https://modelcontextprotocol.io/specification/2025-11-25
Tags
About the Author

asleekgeek
Senior Developer, Architect, DevOps
Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist
Thanks for reading! Explore more articles.
Back to Articles