Model Registry as the Spine — Repository Patterns, Lifecycle States, and Curation Policy

·10 min read·asleekgeek
Diagram showing model lifecycle flow from experimental through staging to production with registry gates

A model registry is governance infrastructure, not a file store.

Every AI platform team eventually faces the same question from an auditor, a postmortem, or a new teammate: "Which model is running in production, where did it come from, and who approved it?" Without a registry, that question has no clean answer. With a registry that is only a file store, it still has no clean answer. A model registry, done properly, is governance infrastructure — it enforces lifecycle states, captures lineage, gates promotion, and gives every consumer a stable, versioned reference to pull from.

This article is Part 4 of the AI Platform Engineering & MLOps series. It builds on the serving patterns in article 13 by establishing what the registry provides to those runtimes — stable references, signed artifacts, and promotion semantics — and it sets the stage for the governance and lineage deep-dive in article 15. If you are coming from article 13, you already know how a serving runtime loads a model; this article explains how a platform team decides which model is worth loading at all.

Why a registry is non-negotiable

A folder on object storage is not a registry. It stores bytes; it does not enforce anything. The gap between "we have model files" and "we operate a registry" is entirely about governance: lifecycle states that gate promotion, lineage that records what produced the artifact, access control that limits who can promote, and a stable consumer API that lets serving runtimes and CI pipelines pin to a specific version.

Three practical gaps appear quickly when teams skip the registry:

  • Version drift. Two teams pull "the latest" model and get different binaries because the S3 path was overwritten. There is no version history, no rollback point.
  • Approval opacity. A model enters production because a data scientist copied files and an engineer updated an environment variable. No one has a record of who approved the change or what eval evidence existed.
  • Supply-chain blindness. No one knows whether the weights were downloaded from a verified source or whether they match what was scanned. A tampered model artifact looks identical to a clean one without a signature check.

Regulations such as the EU AI Act impose traceability, documentation, and human-oversight requirements on high-risk AI systems. A registry that enforces lifecycle gates is the technical substrate for meeting those requirements — without it, compliance reduces to paperwork that does not reflect what is actually running.

The three registry patterns

There is no single registry shape. Three patterns cover the field, each with a different center of gravity:

Pattern 1 — Artifact-store-only

Object storage with a naming convention as the "registry." Model artifacts are blobs at paths like models/risk-classifier/v4/model.safetensors. There is no server, no lifecycle API, and no lineage record beyond what the team manually maintains in a document. This pattern is defensible only during early exploration when the model count is in single digits and there are no production consumers. The moment a second team begins consuming a model, or a model has more than one version, the artifact-store-only pattern creates operational debt that compounds quickly.

Pattern 2 — Lifecycle-aware registry

A purpose-built model registry server that tracks named, versioned models with lifecycle states, metadata, and lineage references. The artifact store (object storage) remains the backing store for weights; the registry is the metadata and governance layer on top. MLflow Model Registry is the canonical open-source implementation — a Linux Foundation project (Apache-2.0) backed by a relational metadata store and object storage for artifacts.

A lifecycle-aware registry exposes stable model URIs (models:/risk-classifier/4), records the training run that produced each version, and provides an API for moving models through states. Modern MLflow implements lifecycle semantics via aliases and tags rather than the earlier fixed stages (None/Staging/Production/Archived), which were deprecated in MLflow 2.9.0 due to their inflexibility for real MLOps workflows. An alias like @champion or @staging can be programmatically moved by a promotion workflow — the lifecycle state is the alias, not a fixed enum field.

Pattern 3 — Full-governance OCI-artifact registry

Treat model artifacts the same as container images — push them to an OCI registry (e.g. Harbor, a cloud container registry) as OCI artifacts. The OCI Image and Distribution Specifications reached version 1.1.0 in February 2024, introducing an artifactType field and a Referrers API that enable any compliant registry to store and relate arbitrary content — not just container images — with full digest-based immutability and access control. The ORAS (OCI Registry As Storage) project provides the de facto CLI and client libraries for pushing and pulling these artifacts.

The OCI pattern reuses the container supply-chain tooling an organization already operates: image scanning, artifact signing, admission-policy enforcement (e.g. via Kyverno or OPA Gatekeeper), and cross-cluster replication. Lifecycle semantics are encoded in OCI tags — a tag like v4-staging or v4-productionbecomes a CNCF sandbox project) use ORAS to separate model weights from inference-runtime images, significantly reducing build times for large models. The KAITO project documentation notes that the workflow for onboarding large AI models is reduced from hours to minutes.

Comparing the three patterns

Five dimensions across the three patterns (P1 = artifact-store-only, P2 = lifecycle-aware registry, P3 = OCI-artifact registry):

  • Artifact storage: Object blobs at a path (P1) │ Object storage referenced via registry metadata (P2) │ OCI manifest + layers in a content-addressable registry (P3)
  • Lifecycle support: None — naming convention only (P1) │ Aliases and tags with API-driven promotion (P2) │ Tag-encoded, enforced via CI or admission policy (P3)
  • Lineage: Manual or absent (P1) │ Training-run linkage via tracking server (P2) │ Referrers API links signatures and SBOMs; training lineage brought externally (P3)
  • Signing: Not native (P1) │ Brought alongside (e.g. Cosign against artifact store) (P2) │ Native — same Cosign + Sigstore pipeline as container images (P3)
  • GitOps integration: Manual path updates (P1) │ Registry webhook triggers manifest bump; GitOps controller reconciles (P2) │ First-class — image update automation watches the OCI tag and commits the manifest update automatically (P3)

The four lifecycle states and their gates

Independent of which registry pattern you choose, models move through four lifecycle states. The implementation differs per pattern (aliases vs. OCI tags), but the state machine is the same:

Experimental

Any team member with registry write access can register a model version here. There is no criteria gate on entry — the purpose is to capture in-flight training runs, ablations, and sandbox experiments in a versioned, auditable location. Models in experimental are not served to production consumers. The gate on entry to staging is where curation begins.

Staging

Promotion from experimental to staging requires three conditions recorded in the registry before a platform engineer can authorize the transition: a named owning team, a linked training run with lineage intact, and a passing eval-suite result. Models in staging are reachable for shadow or canary traffic — real requests compared against the production model without routing user-visible traffic to the candidate.

Production

Promotion from staging to production adds a further gate: the shadow/canary traffic eval passed, a rollback plan is documented, and an on-call runbook for the new version exists. After promotion, the serving manifest in the GitOps repository is updated — either by a human PR or by an automated registry-update controller. Models in production serve live user traffic.

Retired

Transition to retired starts a sunset clock. The model remains servable for consumers that have not yet migrated. When the clock expires, the model moves to archived — the artifact is retained for audit and reproducibility but the serving infrastructure is not required to load it. Without a sunset clock and an enforced migration path, production registries accumulate deprecated models indefinitely.

The gate summary

  • experimental → staging: named owner + lineage complete + eval pass. Authorized by platform engineer.
  • staging → production: shadow/canary eval pass + rollback plan + on-call runbook. Authorized by platform team lead.
  • production → retired: replacement in production + sunset clock set + migration path documented. Authorized by model owner.
  • retired → archived: sunset clock expired + zero production consumers confirmed. Automatic or platform-authorized.

Curation policy as code

A lifecycle state machine is only as strong as the gates that enforce it. "Policy as code" means the gates are executable checks in CI, not checklists in a wiki.

A concrete example for the experimental → staging gate:

promotion-gate.yaml (conceptual)
# Promotion gate: experimental -> staging
# Runs as a step in the training pipeline CI job

steps:
  - name: Check lineage completeness
    # Fails if training_run_id, dataset_version, or code_commit
    # is absent from the candidate model version metadata
    run: |
      python check_lineage.py \
        --model-name $MODEL_NAME \
        --model-version $MODEL_VERSION

  - name: Run eval suite against candidate
    run: |
      python run_eval.py \
        --model-name $MODEL_NAME \
        --model-version $MODEL_VERSION \
        --baseline-alias champion \
        --min-improvement 0.0

  - name: Verify artifact signature
    run: |
      cosign verify --key cosign.pub \
        $ARTIFACT_DIGEST

  - name: Promote to staging
    # Only executes if all prior steps pass
    run: |
      python promote_model.py \
        --model-name $MODEL_NAME \
        --model-version $MODEL_VERSION \
        --target-alias staging

The invariant: no model can enter staging without a model card, a signed artifact, and a passing eval result. Each check writes evidence back to the registry as a tag — the audit trail records not just that the model is in staging, but which CI run produced the evidence that justified the promotion.

The role split

Curation requires explicit role accountability. The most common failure mode is "no one is accountable" — a model enters production because everyone assumed someone else's check ran. The registry's access control encodes the role split:

  • Any data scientist or ML engineer can register to experimental.
  • An MLOps engineer (or an automated CI agent acting on their behalf) can promote to staging when gates pass.
  • A platform team lead can promote to production after the additional rollout gate passes.
  • A model owner (with platform-team co-authorization) can retire a model.

This role split is enforced by the registry's permission model — not by convention. An engineer who attempts to promote a model to production without the required role gets a permission denial, not a polite reminder.

Registry-promotion as a GitOps trigger

The registry and the GitOps repository are two systems with a defined bridge. The registry is the authority on which model version is approved; the Git repository is the authority on what is deployed. The bridge is a controller that watches the registry for state transitions and commits manifest updates to Git when a promotion occurs.

For the OCI-artifact pattern, this bridge is a solved problem. Argo CD Image Updater monitors an OCI registry for new image digests or tags matching a version constraint, and writes the updated reference back to the Git manifest. Flux's image-automation-controller provides the equivalent for Flux-managed clusters: the image-reflector-controller scans the registry, and the automation controller patches the YAML and commits to Git.

For the lifecycle-aware registry pattern, the same bridge is built with a webhook. When a model is promoted to the production alias, the registry fires a webhook that triggers a CI job to update the serving manifest in Git. The GitOps controller then reconciles the change. A promotion event in the registry deterministically drives a deployment.

The serving manifest references the model by a fully-pinned version, not by a lifecycle alias. Every manifest change is a Git commit with a message that records the source event, providing a clean audit trail that links the deployed version to its registry promotion.

Consuming curated models

Three consumer roles interact with the registry in distinct ways:

Data scientist or ML engineer in local development

Pin to a specific version, not a lifecycle alias — an alias shifts when someone promotes a new model mid-notebook, breaking reproducibility. Cache the artifact locally; configure the cache root once and document it in the team runbook. Authenticate via SSO or a scoped personal token rather than a shared secret.

CI pipeline

Pull a pinned version. Cache between runs. Verify the signature before using the artifact — CI is the right place to catch a tampered artifact, not the admission webhook at deploy time. A signature failure in CI surfaces hours earlier than one at rollout. The eval result written back to the registry as a tag by the CI job is the evidence the promotion gate requires.

Serving runtime at deployment

The serving manifest carries a fully-pinned model reference. After a rollout, assert the runtime's /info or /version endpoint in a smoke test to confirm the served version matches the manifest. Silent deploys — where the GitOps controller appeared to succeed but the node is serving a stale cached model — are real failure modes that this check catches before users do.

Common pitfalls

  • Running two registries in parallel. Experiment tracking in one tool, production artifacts in another. Sources of truth diverge and teams lose track of which version is serving.
  • Baking the model into the container image. The runtime image and the model artifact have different lifecycles. Coupling them means every model quality update is a container rebuild and every runtime security patch triggers a full model re-deploy cycle.
  • Lifecycle aliases in production manifests. A manifest referencing @champion or the tag latest silently picks up the next promoted version without a manifest change. GitOps history becomes unreliable. Use fully-pinned version references; let automation write the pin.
  • Eval-suite gaming. The eval suite that gates promotion becomes the eval suite teams optimize for. Rotate the eval benchmark periodically and hold back a portion of the test set from training.
  • No deprecation discipline. Without a sunset clock and enforced migration path, the registry becomes a museum and the serving infrastructure carries dead weight.

References

  1. Linux Foundation. "The MLflow Project Joins Linux Foundation." Press release, June 2020.
  2. MLflow Project. "ML Model Registry." MLflow documentation, mlflow.org.
  3. MLflow GitHub. "RFC: deprecating model registry stages." Issue #10336, mlflow/mlflow.
  4. Open Container Initiative. "OCI Image and Distribution Specs v1.1 Releases." OCI blog, March 2024.
  5. ORAS Project. "OCI Registry As Storage." oras.land.
  6. KAITO Project. "Model As OCI Artifacts." kaito-project.github.io.
  7. Kubernetes Blog. "Kubernetes 1.31: Read Only Volumes Based On OCI Artifacts (alpha)." August 2024.
  8. Argo CD Image Updater. Official documentation. argocd-image-updater.readthedocs.io.
  9. Flux CD. "Automate image updates to Git." fluxcd.io.

Tags

#model-registry#mlflow#series:ai-platform-mlops#series-order/14

About the Author

asleekgeek

asleekgeek

Senior Developer, Architect, DevOps

Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist

Thanks for reading! Explore more articles.

Back to Articles