AI Platform Engineering & MLOps · Part XIV of 34

Model Registry as the Spine — Repository Patterns, Lifecycle States, and Curation Policy

Why a model registry is non-negotiable at platform scale, the three registry patterns, four lifecycle states with explicit gates, and curation policy as code.

11 min read·3 interactive components·9 references

Producers register versionsLifecycle states gate promotionConsumers pull pinned references

Every AI platform team eventually faces the same question from an auditor, a postmortem, or a new teammate: “Which model is running in production, where did it come from, and who approved it?” Without a registry, that question has no clean answer. With a registry that is only a file store, it still has no clean answer. A model registry, done properly, is governance infrastructure — it enforces lifecycle states, captures lineage, gates promotion, and gives every consumer a stable, versioned reference to pull from.

This article is Part 14 of the AI Platform Engineering & MLOps series. It builds on the serving patterns in article 13 by establishing what the registry provides to those runtimes — stable references, signed artifacts, and promotion semantics — and it sets the stage for the governance and lineage deep-dive in article 15. If you are coming from article 13, you already know how a serving runtime loads a model; this article explains how a platform team decides which model is worth loading at all.

Why a registry is non-negotiable

A folder on object storage is not a registry. It stores bytes; it does not enforce anything. The gap between “we have model files” and “we operate a registry” is entirely about governance: lifecycle states that gate promotion, lineage that records what produced the artifact, access control that limits who can promote, and a stable consumer API that lets serving runtimes and CI pipelines pin to a specific version.

Three practical gaps appear quickly when teams skip the registry:

Version drift. Two teams pull “the latest” model and get different binaries because the S3 path was overwritten. There is no version history, no rollback point.
Approval opacity. A model enters production because a data scientist copied files and an engineer updated an environment variable. No one has a record of who approved the change or what eval evidence existed.
Supply-chain blindness. No one knows whether the weights were downloaded from a verified source or whether they match what was scanned. A tampered model artifact looks identical to a clean one without a signature check.

Regulations such as the EU AI Act impose traceability, documentation, and human-oversight requirements on high-risk AI systems. A registry that enforces lifecycle gates is the technical substrate for meeting those requirements — without it, compliance reduces to paperwork that does not reflect what is actually running.

The three registry patterns

There is no single registry shape. Three patterns cover the field, each with a different center of gravity:

Pattern 1 — Artifact-store-only

Object storage with a naming convention as the “registry.” Model artifacts are blobs at paths like models/risk-classifier/v4/model.safetensors. There is no server, no lifecycle API, and no lineage record beyond what the team manually maintains in a document. This pattern is defensible only during early exploration when the model count is in single digits and there are no production consumers. The moment a second team begins consuming a model, or a model has more than one version, the artifact-store-only pattern creates operational debt that compounds quickly.

Pattern 2 — Lifecycle-aware registry

A purpose-built model registry server that tracks named, versioned models with lifecycle states, metadata, and lineage references. The artifact store (object storage) remains the backing store for weights; the registry is the metadata and governance layer on top. MLflow Model Registry is the canonical open-source implementation — a Linux Foundation project (Apache-2.0) backed by a relational metadata store and object storage for artifacts.

A lifecycle-aware registry exposes stable model URIs (models:/risk-classifier/4), records the training run that produced each version, and provides an API for moving models through states. Modern MLflow implements lifecycle semantics via aliases and tags rather than the earlier fixed stages (None/Staging/Production/Archived), which were deprecated in MLflow 2.9.0 due to their inflexibility for real MLOps workflows. An alias like @champion or @staging can be programmatically moved by a promotion workflow — the lifecycle state is the alias, not a fixed enum field.

Pattern 3 — Full-governance OCI-artifact registry

Treat model artifacts the same as container images — push them to an OCI registry (e.g. Harbor, a cloud container registry) as OCI artifacts. The OCI Image and Distribution Specifications reached version 1.1.0 in February 2024, introducing an artifactType field and a Referrers API that enable any compliant registry to store and relate arbitrary content — not just container images — with full digest-based immutability and access control. The ORAS (OCI Registry As Storage) project provides the de facto CLI and client libraries for pushing and pulling these artifacts.

The OCI pattern reuses the container supply-chain tooling an organization already operates: image scanning, artifact signing, admission-policy enforcement (e.g. via Kyverno or OPA Gatekeeper), and cross-cluster replication. Lifecycle semantics are encoded in OCI tags — a tag like v4-staging or v4-production. Projects like KAITO (which became a CNCF sandbox project) use ORAS to separate model weights from inference-runtime images, significantly reducing build times for large models. The KAITO project documentation notes that the workflow for onboarding large AI models is reduced from hours to minutes.

Comparing the three patterns

Five dimensions across the three patterns (P1 = artifact-store-only, P2 = lifecycle-aware registry, P3 = OCI-artifact registry):

Dimension	P1 — Artifact-store-only	P2 — Lifecycle-aware	P3 — OCI-artifact
Artifact storage	Object blobs at a path	Object storage referenced via registry metadata	OCI manifest + layers in a content-addressable registry
Lifecycle support	None — naming convention only	Aliases and tags with API-driven promotion	Tag-encoded, enforced via CI or admission policy
Lineage	Manual or absent	Training-run linkage via tracking server	Referrers API links signatures and SBOMs; training lineage brought externally
Signing	Not native	Brought alongside (e.g. Cosign against artifact store)	Native — same Cosign + Sigstore pipeline as container images
GitOps integration	Manual path updates	Registry webhook triggers manifest bump; GitOps controller reconciles	First-class — image update automation watches the OCI tag and commits the manifest update automatically

The comparer below puts the three patterns side by side. Select a pattern to inspect its five dimensions and the when-to-use guidance, or open the full three-way comparison.

Registry Pattern Comparer

Select a pattern to inspect its five dimensions, or open the full three-way comparison.

Lifecycle-aware registry

Registry server + metadata + aliases

Artifact storage: Object storage remains the backing store for weights; the registry is the metadata and governance layer on top.
Lifecycle support: Aliases and tags with API-driven promotion (e.g. @champion, @staging) — the lifecycle state is the alias, not a fixed enum.
Lineage: Training-run linkage via the tracking server.
Signing: Brought alongside (e.g. Cosign against the artifact store).
GitOps integration: Registry webhook triggers a manifest bump; the GitOps controller reconciles.

Reach for it when

The default for teams that need stable model URIs (models:/risk-classifier/4), recorded training runs, and an API for state transitions. MLflow Model Registry is the canonical open-source implementation.

Watch out

Fixed stages (None/Staging/Production/Archived) were deprecated in MLflow 2.9.0 — model lifecycle on aliases, not enums.

The four lifecycle states and their gates

Independent of which registry pattern you choose, models move through four lifecycle states. The implementation differs per pattern (aliases vs. OCI tags), but the state machine is the same:

Experimental

Any team member with registry write access can register a model version here. There is no criteria gate on entry — the purpose is to capture in-flight training runs, ablations, and sandbox experiments in a versioned, auditable location. Models in experimental are not served to production consumers. The gate on entry to staging is where curation begins.

Staging

Promotion from experimental to staging requires three conditions recorded in the registry before a platform engineer can authorize the transition: a named owning team, a linked training run with lineage intact, and a passing eval-suite result. Models in staging are reachable for shadow or canary traffic — real requests compared against the production model without routing user-visible traffic to the candidate.

Production

Promotion from staging to production adds a further gate: the shadow/canary traffic eval passed, a rollback plan is documented, and an on-call runbook for the new version exists. After promotion, the serving manifest in the GitOps repository is updated — either by a human PR or by an automated registry-update controller. Models in production serve live user traffic.

Retired

Transition to retired starts a sunset clock. The model remains servable for consumers that have not yet migrated. When the clock expires, the model moves to archived — the artifact is retained for audit and reproducibility but the serving infrastructure is not required to load it. Without a sunset clock and an enforced migration path, production registries accumulate deprecated models indefinitely.

The gate summary

experimental → staging: named owner + lineage complete + eval pass. Authorized by platform engineer.
staging → production: shadow/canary eval pass + rollback plan + on-call runbook. Authorized by platform team lead.
production → retired: replacement in production + sunset clock set + migration path documented. Authorized by model owner.
retired → archived: sunset clock expired + zero production consumers confirmed. Automatic or platform-authorized.

The state machine below lets you promote a model version through the lifecycle. Each transition demands its gate conditions before the promote button unlocks, and the audit trail records who authorized what — exactly the evidence an auditor will ask for.

Lifecycle State Machine

Walk a model version through the lifecycle. Tick each gate condition; the promote button only fires when every gate passes — exactly like the CI gate it models.

experimental

staging

production

retired

archived

Current state: experimental

Any team member with registry write access can register a version here. No criteria gate on entry — the purpose is to capture in-flight training runs, ablations, and sandbox experiments in a versioned, auditable location. Not served to production consumers.

Gates for experimental → staging

Authorized by: Platform engineer

Promotion blocked — gates are executable checks, not a checklist in a wiki.

Curation policy as code

A lifecycle state machine is only as strong as the gates that enforce it. “Policy as code” means the gates are executable checks in CI, not checklists in a wiki.

A concrete example for the experimental → staging gate:

promotion-gate.yaml (conceptual)

# Promotion gate: experimental -> staging
# Runs as a step in the training pipeline CI job

steps:
  - name: Check lineage completeness
    # Fails if training_run_id, dataset_version, or code_commit
    # is absent from the candidate model version metadata
    run: |
      python check_lineage.py \
        --model-name $MODEL_NAME \
        --model-version $MODEL_VERSION

  - name: Run eval suite against candidate
    run: |
      python run_eval.py \
        --model-name $MODEL_NAME \
        --model-version $MODEL_VERSION \
        --baseline-alias champion \
        --min-improvement 0.0

  - name: Verify artifact signature
    run: |
      cosign verify --key cosign.pub \
        $ARTIFACT_DIGEST

  - name: Promote to staging
    # Only executes if all prior steps pass
    run: |
      python promote_model.py \
        --model-name $MODEL_NAME \
        --model-version $MODEL_VERSION \
        --target-alias staging

The invariant: no model can enter staging without a model card, a signed artifact, and a passing eval result. Each check writes evidence back to the registry as a tag — the audit trail records not just that the model is in staging, but which CI run produced the evidence that justified the promotion.

The role split

Curation requires explicit role accountability. The most common failure mode is “no one is accountable” — a model enters production because everyone assumed someone else’s check ran. The registry’s access control encodes the role split:

Any data scientist or ML engineer can register to experimental.
An MLOps engineer (or an automated CI agent acting on their behalf) can promote to staging when gates pass.
A platform team lead can promote to production after the additional rollout gate passes.
A model owner (with platform-team co-authorization) can retire a model.

This role split is enforced by the registry’s permission model — not by convention. An engineer who attempts to promote a model to production without the required role gets a permission denial, not a polite reminder.

Registry-promotion as a GitOps trigger

The registry and the GitOps repository are two systems with a defined bridge. The registry is the authority on which model version is approved; the Git repository is the authority on what is deployed. The bridge is a controller that watches the registry for state transitions and commits manifest updates to Git when a promotion occurs.

For the OCI-artifact pattern, this bridge is a solved problem. Argo CD Image Updater monitors an OCI registry for new image digests or tags matching a version constraint, and writes the updated reference back to the Git manifest. Flux’s image-automation-controller provides the equivalent for Flux-managed clusters: the image-reflector-controller scans the registry, and the automation controller patches the YAML and commits to Git.

For the lifecycle-aware registry pattern, the same bridge is built with a webhook. When a model is promoted to the production alias, the registry fires a webhook that triggers a CI job to update the serving manifest in Git. The GitOps controller then reconciles the change. A promotion event in the registry deterministically drives a deployment.

The serving manifest references the model by a fully-pinned version, not by a lifecycle alias. Every manifest change is a Git commit with a message that records the source event, providing a clean audit trail that links the deployed version to its registry promotion.

Consuming curated models

Three consumer roles interact with the registry in distinct ways:

Data scientist or ML engineer in local development

Pin to a specific version, not a lifecycle alias — an alias shifts when someone promotes a new model mid-notebook, breaking reproducibility. Cache the artifact locally; configure the cache root once and document it in the team runbook. Authenticate via SSO or a scoped personal token rather than a shared secret.

CI pipeline

Pull a pinned version. Cache between runs. Verify the signature before using the artifact — CI is the right place to catch a tampered artifact, not the admission webhook at deploy time. A signature failure in CI surfaces hours earlier than one at rollout. The eval result written back to the registry as a tag by the CI job is the evidence the promotion gate requires.

Serving runtime at deployment

The serving manifest carries a fully-pinned model reference. After a rollout, assert the runtime’s /info or /version endpoint in a smoke test to confirm the served version matches the manifest. Silent deploys — where the GitOps controller appeared to succeed but the node is serving a stale cached model — are real failure modes that this check catches before users do.

Common pitfalls

Running two registries in parallel. Experiment tracking in one tool, production artifacts in another. Sources of truth diverge and teams lose track of which version is serving.
Baking the model into the container image. The runtime image and the model artifact have different lifecycles. Coupling them means every model quality update is a container rebuild and every runtime security patch triggers a full model re-deploy cycle.
Lifecycle aliases in production manifests. A manifest referencing @champion or the tag latest silently picks up the next promoted version without a manifest change. GitOps history becomes unreliable. Use fully-pinned version references; let automation write the pin.
Eval-suite gaming. The eval suite that gates promotion becomes the eval suite teams optimize for. Rotate the eval benchmark periodically and hold back a portion of the test set from training.
No deprecation discipline. Without a sunset clock and enforced migration path, the registry becomes a museum and the serving infrastructure carries dead weight.

References

[1] Linux Foundation. “The MLflow Project Joins Linux Foundation.” Press release, June 2020. linuxfoundation.org
[2] MLflow Project. “ML Model Registry.” MLflow documentation, mlflow.org. mlflow.org
[3] MLflow GitHub. “RFC: deprecating model registry stages.” Issue #10336, mlflow/mlflow. github.com/mlflow/mlflow
[4] Open Container Initiative. “OCI Image and Distribution Specs v1.1 Releases.” OCI blog, March 2024. opencontainers.org
[5] ORAS Project. “OCI Registry As Storage.” oras.land. oras.land
[6] KAITO Project. “Model As OCI Artifacts.” kaito-project.github.io. kaito-project.github.io
[7] Kubernetes Blog. “Kubernetes 1.31: Read Only Volumes Based On OCI Artifacts (alpha).” August 2024. kubernetes.io
[8] Argo CD Image Updater. Official documentation. argocd-image-updater.readthedocs.io. argocd-image-updater.readthedocs.io
[9] Flux CD. “Automate image updates to Git.” fluxcd.io. fluxcd.io

Continue the Journey

AI Platform