AI Platform Engineering & MLOps · Part XIX of 34

Prompts and tools are code — versioning, registries, and the rollback story

Prompt templates and tool definitions determine LLM system behaviour as surely as model weights. Here is the versioning, registry, and rollback architecture that keeps them under control.

12 min read·2 interactive components·6 references

Current production versionStaged versionDraft / deprecatedRollback path

In a classical ML system, the model artefact encodes almost all of the learned behaviour. In an LLM system, a substantial fraction of the system’s behaviour lives somewhere else: in the prompt template. A typical production prompt contains a system instruction, few-shot examples, an output-format specification, tool definitions, and safety guardrails. Change any of those components and the system behaves differently — even if no model weight has moved. Yet in many early LLM deployments, prompts live in environment variables, shared config files, or comment threads, with no version history, no review gate, and no rollback path.

The same logic applies to tool definitions — the JSON schemas that tell an LLM what external functions it can call, what parameters they accept, and what they do. The tool-use pattern, formalised as the ReAct framework by Yao et al. (ICLR 2023), has moved from research curiosity to production infrastructure in a short time. A tool definition is an interface contract between an LLM and the wider system. Changing it without versioning the change is equivalent to silently modifying an API contract at runtime. This article covers the versioning architecture that treats prompts and tool definitions as first-class engineering artefacts.

Three failure modes of unversioned prompts

Each of these failure modes has appeared in production deployments. They are not theoretical.

Silent regression

An engineer updates a system prompt to improve output formatting. The change looks cosmetic. In production, the model now refuses a class of requests it previously handled. The HTTP response code is still 200. There is no exception. Automated monitoring never fires. At a volume of roughly 10,000 queries per day, a 5% regression rate produces approximately 500 silent failures per day — a figure that only surfaces when a downstream metric (user satisfaction, downstream data quality) degrades. By then, correlation to the prompt change is weeks stale [1].

Aliasing failure

Downstream services reference the “current production” prompt by a logical name — often a hard-coded string — without pinning a version. The prompt is updated, the downstream service breaks, and incident response cannot identify which prompt version was live at the time of the incident because no version was ever recorded.

Rollback impossibility

A prompt change is made directly to a configuration value in a production secret store. When the change causes a regression, there is no clean rollback path because the previous value was never committed anywhere. Recovering the prior prompt requires reconstructing it from memory or chat logs.

Four requirements for a prompt registry

Any system that claims to version prompts must satisfy all four of the following properties. A system that satisfies only some of them is not a registry — it is version history, which is a weaker property.

1
Immutable versions
A committed prompt version must not change. Create a new version and deprecate the old one. This is the same requirement as for model artefacts in a model registry.
2
Staged promotion
Versions move through a lifecycle: draft → staging → production. No prompt version reaches production without passing an evaluation gate. This mirrors the model registry lifecycle from earlier in this series.
3
Alias resolution
Downstream systems should reference a logical alias such as @production rather than a pinned version identifier such as v47. When a new version is promoted, the alias resolves to the new version automatically. Services do not need code changes to pick up the updated prompt. MLflow's Prompt Registry ships alias management with a 60-second TTL cache, surfaced via the prompts:/<name>@<alias> URI format [2].
4
Lineage
Each prompt version must link to the evaluation run that validated it and the model version it was validated against. Evaluating prompt v47 against model v3 does not validate it against model v4. Lineage makes this explicit and queryable.

Registry pattern options: three approaches × four dimensions

There is no single correct answer. The right pattern depends on the team’s existing infrastructure, deployment context, and whether a SaaS dependency is acceptable. The table below maps the three main approaches against the four registry requirements.

Approach	Immutable versions	Staged promotion	Alias resolution	Lineage
Git as registry	✓ (Git SHA)	✓ (CI-enforced PR gate)	⚠ (custom service needed)	⚠ (commit messages + PR links)
MLflow / Comet Opik (self-hosted)	✓	✓	✓ (60 s TTL cache)	✓ (linked to model run)
LangSmith / W&B Weave (SaaS)	✓	✓	✓	✓ (trace-linked)

Option 1: Git as the prompt registry

Prompt templates are stored as files in a version-controlled repository. Version tags (e.g. v47) or annotated branches mark promotion points. A thin wrapper service resolves logical aliases to the current tagged commit.

Immutable versions: Git SHA provides this natively. Staged promotion: PR review gates can enforce this if promotion rules are encoded in CI. Alias resolution: Requires a custom query layer (a thin service under 200 lines is sufficient) — not provided out of the box. Lineage: Approximated via commit messages and PR links to evaluation run IDs; not queryable through an API.

This is the minimum viable path and the right starting point for teams without existing registry infrastructure. It is fully self-hostable and compatible with regulated or air-gapped environments. The alias resolution gap is the primary limitation: implement a thin lookup service in front of the repository and treat it as a platform-owned component.

Option 2: A prompt registry integrated with model tracking

Several ML tracking tools now offer first-class prompt registry features. MLflow’s Prompt Registry (available in MLflow 3.x) ships explicit alias management: mlflow.genai.set_prompt_alias("name", alias="production", version=N) promotes a version without redeploying anything. The prompts:/<name>@<alias> URI format integrates with the existing model registry client [2]. LangSmith provides commit-based versioning with @production and @staging as reserved environment labels; alias-tagged prompts resolve without code changes [3].

All four requirements are first-class. The integration trade-off is deployment context. Self-hosted options (MLflow, Comet Opik with self-hosted mode) are compatible with regulated deployments. SaaS-only tools introduce a dependency that may be prohibited in air-gapped or regulated environments.

Option 3: LLM observability tooling with built-in prompt management

A third tier of tooling treats prompt versioning as a feature of the observability layer rather than a separate registry. Tools in this category — including LangSmith, Weights & Biases Weave, and Comet Opik — store prompt versions alongside execution traces, which makes it easy to correlate a prompt version with the production behaviour it produced. The lineage story is particularly strong here: the prompt version, the model, the inputs, the outputs, and any evaluation scores are all recorded in the same trace.

The limitation is portability. These tools are primarily SaaS products (with partial self-hosted options). For regulated deployments, the export and data-residency posture of the chosen tool must be validated before adoption. When it is acceptable, this tier offers the richest developer experience at the cost of an external dependency.

The simulator below lets you commit a draft prompt, run the eval gate, promote to @production, and roll back — exactly as the registry lifecycle described above.

Prompt Registry Simulator

Edit the prompt, commit it as a new draft, run the eval gate, promote to production, or roll back — exactly as described in the article.

Prompt Editor (draft)

Registry Versions

@production → v46

v46production

You are a helpful, accurate assistant. Answer questions concisely. Do not fabricate facts.

Eval: 89% ✓ passed

2025-12-04

v45deprecated

You are a helpful assistant. Answer questions concisely.

Eval: 72% ✓ passed

2025-11-10

Activity Log

Registry initialised. v45 deprecated, v46 in production.

Tool versioning: the same problem, a harder interface

Tool definitions — the JSON Schema-typed schemas that describe what external functions an LLM can call — are a parallel versioning problem with a harder interface contract. The ReAct pattern (Yao et al., ICLR 2023) formalised the observation that interleaving reasoning traces with discrete tool invocations is a reproducible and effective approach to complex task completion [4]. As this pattern has moved into production systems at scale, the operational properties of tool definitions have come into focus.

A tool definition specifies three things: (1) the tool’s name and description, which influence when the model decides to call it; (2) the parameter schema, which determines what arguments the model passes; and (3) the implementation reference, which maps the schema to an actual function or API endpoint. Changing any of these constitutes a behavioural change even if no code was deployed.

Name/description changes change model behaviour by altering when the tool is called. A tool named search_documents that is renamed to retrieve_context will be called in different situations by the same model, because the model uses the description to decide whether to invoke it. Parameter schema changes that add required fields or rename existing ones cause argument-parsing failures in any caller that has not been updated. These failures may surface as model-generated JSON that no longer validates against the schema, or as runtime exceptions from the tool implementation receiving unexpected fields.

The minimum viable approach is co-versioning: tool definitions are versioned alongside the prompt templates that reference them. A single version identifier covers the complete prompt + tool set that was evaluated together. A tool definition change requires a new prompt version, which in turn requires a new evaluation run before promotion to production.

The diff viewer below illustrates the blast radius of four common tool definition changes — from tool renames to adding required fields — and the safe upgrade path for each.

Tool Schema Diff Viewer

Select a change scenario to see the schema diff, blast radius, and safe upgrade path.

Renaming search_documents → retrieve_context changes when the model calls this tool.

Before

{
  "name": "search_documents",
  "description": "Search internal documents by keyword.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string"
      },
      "limit": {
        "type": "number",
        "default": 10
      }
    },
    "required": [
      "query"
    ]
  }
}

After

{
  "name": "retrieve_context",
  "description": "Retrieve context passages by semantic similarity.",
  "parameters": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string"
      },
      "limit": {
        "type": "number",
        "default": 10
      }
    },
    "required": [
      "query"
    ]
  }
}

Diff

- "name": "search_documents",

+ "name": "retrieve_context",

- "description": "Search internal documents by keyword.",

+ "description": "Retrieve context passages by semantic similarity.",

"parameters": { ... }

Blast Radius

highCall frequency change

The model uses name + description to decide when to invoke the tool. A semantic rename will alter invocation patterns — the tool may be called less or more frequently.

highCallers referencing old name

Any code that pattern-matches on the tool name in the response (e.g. tool_use.name === "search_documents") will silently miss calls.

mediumEval suite breakage

Eval cases expecting tool_use calls with the old name will fail, correctly signalling a regression before production.

Safe Upgrade Path

1Expose both names simultaneously in the tool list for one release cycle.
2Increment the prompt version to include the renamed tool definition.
3Run eval suite — verify invocation patterns remain equivalent.
4Deprecate old name in the next release after monitoring confirms parity.

The Model Context Protocol as a versioned tool interface

The Model Context Protocol (MCP), published by Anthropic in November 2024 and donated to the Linux Foundation’s AI & Data Foundation in December 2025, standardises the tool interface contract between an LLM client and an MCP server [5]. Under MCP, tool definitions are JSON Schema-typed contracts exposed by the server. An LLM client discovers available tools via a tools/list call and invokes them via tools/call. The protocol version is negotiated at connection time via the initialize handshake.

From a versioning perspective, MCP promotes tool definitionsto a first-class, machine-readable interface. Changing a tool’s name, description, or parameter schema in an MCP server constitutes a breaking change to the interface. The MCP community is actively developing proposals for semantic versioning of tool definitions (tracked in community proposals SEP-1575 for per-tool versioning and SEP-1400 for protocol-level semantic versioning) [6]. In the interim, the operational recommendation is to treat MCP server tool definitions as an interface specification, version them in source control alongside the server implementation, and apply the same PR-and-review gate that applies to any breaking API change.

Practically, this means each MCP server should expose a stable version identifier that clients can log alongside traces. When an MCP server is updated, the version changes, and any downstream systems that depend on specific tool names or schemas should pin to the version they tested against until they are ready to migrate.

The analogy is exact: MCP servertool definitions are public APIs. Treat them accordingly — semantic versioning, changelog, migration guides. The only difference from a REST API is that the “client” is an LLM that learned the interface from the description, not from documentation it can re-read at runtime.

The evaluation gate: the mechanism that makes promotion safe

Versioning alone is not sufficient. A system that creates immutable versions but promotes them without an automated gate is still a system where a regression can reach production. The evaluation gate is what turns a version history into a safe deployment pipeline.

A minimum viable evaluation gate for prompt promotion includes four checks:

A regression suite: a fixed set of inputs with expected outputs or expected behaviour patterns. The new prompt version must match or exceed the baseline pass rate.
A capability benchmark: the new version must perform at least as well as the current production version on a defined task set. Regression-only testing catches breakage but not degradation.
A safety check: for user-facing prompts, an adversarial probe set and refusal behaviour validation. Safety properties that held for the previous version must hold for the new version.
A cost/latency budget check: the new prompt must not exceed the production SLA on token count, response latency, or cost per call. Prompt engineering changes that improve quality by adding tokens can silently break latency contracts.

The LLM-as-judge evaluation pattern — using a separate, capable LLM to score the outputs of the system under test — is the most practical mechanism for automating quality checks at scale. It is covered in the companion article in this series (article 18, eval-as-test-suite). The evaluation gate described here is the consumer of that infrastructure: it runs the suite, compares the result to a threshold, and either promotes the alias or blocks the deploy.

A rollback flow that does not require redeploying the model

This is the key operational advantage of alias-based prompt registries over prompt-in-code approaches. The rollback procedure is:

1Identify the last-known-good prompt version from the registry. In a registry with lineage, this is the version whose evaluation run passed before the regression was introduced.
2Re-promote that version’s alias to @production. In MLflow, this is a single API call: mlflow.genai.set_prompt_alias("my-prompt", alias="production", version=N-1). No application code changes. No model redeploy.
3Because downstream services reference @production rather than a pinned version, they pick up the rolled-back alias within the TTL of their alias cache (60 seconds in MLflow, near-instant in Git-tag-based resolvers with no caching).
4Record the rollback as a versioned event in the registry and open a post-incident review to understand why the regression escaped the evaluation gate.

Compare this to the alternative: a prompt baked into application code requires a code push, a CI pipeline run, a container rebuild, and a deploy — a process that commonly takes 20–40 minutes even with fast pipelines. For a production regression, a 30-minute rollback window is long. The alias-promotion model reduces this to seconds.

The key insight: The TTL cache is a feature, not a bug. A 60-second propagation window means a rollback takes effect within a minute across all services — orders of magnitude faster than any code deployment path. The cost is a 60-second window where a promoted version may already be live; in practice, a staged rollout (promote @staging first, observe, then promote @production) eliminates this risk before it reaches the full traffic volume.

The seam to the AI gateway

Prompt and tool versioning exists inside the application boundary. The AI gateway sits at the boundary between the application and the upstream model provider (whether that is a self-hosted inference server or an external API). There is a seam between the two that the prompt registry and the gateway share: the gateway enforces rate limits, cost budgets, and request routing; the prompt registry enforces what prompts can reach production and under what conditions.

The operational handshake between the two looks like this: the application pulls a prompt version from the registry using the @production alias, injects any runtime variables, and sends the assembled request through the gateway. The gateway sees the full request and can log or hash the prompt for observability — but it cannot and should not override the prompt version. Version authority belongs to the registry; enforcement authority belongs to the gateway. This separation becomes important when the gateway is a multi-tenant shared service (covered in article 21 of this series).

Choosing between the three patterns

The decision depends on four variables:

Deployment context: Air-gapped or regulated environments must use self-hostable options. Git-as-registry or self-hosted MLflow are the viable paths. SaaS-only tools are excluded.
Existing infrastructure: Teams already running a model registry based on MLflow can extend it to prompts at low incremental cost. Teams with no existing tracking infrastructure should start with Git-as-registry and add alias resolution as a thin service.
Team size and collaboration needs: Larger teams with multiple engineers editing prompts concurrently benefit from a registry UI with comparison views and commenting. Smaller teams can operate well with Git-native workflows.
MCP adoption: If the system uses MCP-server-based tool definitions, versioning the MCP server interface specification becomes a first-class concern. Treat the MCP server repository as the authoritative source for tool definitions and apply the same alias-promotion and gate pattern to MCP server version updates.

In all three cases, the underlying principle is the same: prompts and tool definitions determine system behaviour, they change frequently, and they must be subject to the same version control, review, and automated gate that any other behaviour-determining artefact carries. The tooling is secondary to establishing that principle operationally.

The starting gun: The minimum credible position is: every promptin production has a Git SHA, every change goes through a PR, and rolling back means reverting a commit. From that baseline, add alias resolution, then staged promotion, then eval-gated advancement. The gap between “prompts in environment variables” and “prompts with a Git SHA” is the most impactful single move.

References

[1] Statsig. “Prompt regression testing: Preventing quality decay.” 2025. Supporting cross-reference: “Test Before You Deploy: Governing Updates in the LLM Supply Chain.” arXiv:2604.27789 (2026). arxiv.org/abs/2604.27789
[2] MLflow. “Prompt Registry — MLflow AI Platform.” 2025. mlflow.org/docs/latest/genai/prompt-registry/. Alias lifecycle: manage-prompt-lifecycles-with-aliases
[3] LangChain. “Manage prompts — LangSmith documentation.” 2025. docs.langchain.com/langsmith/manage-prompts
[4] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR 2023. arXiv:2210.03629. arxiv.org/abs/2210.03629
[5] Anthropic / MCP community. “Model Context Protocol Specification 2025-11-25.” November 2024; donated to Linux Foundation / AAIF December 2025. Spec: modelcontextprotocol.io/specification/2025-11-25. Announcement: anthropic.com/news/model-context-protocol
[6] ToolRegistry paper: arxiv.org/pdf/2507.10593. Kumaran, I. “Evolvable MCP: A Guide to MCP Tool Versioning.” Medium, February 2026. medium.com/@kumaran.isk/evolvable-mcp

Continue the Journey

AI Platform