Golden paths for ML — paved-road templates that survive contact with users

·9 min read·asleekgeek
A forked road through a forest, one branch clearly paved and marked, representing a golden path in platform engineering

A golden path is paved, not mandated — teams can leave it, but they own what they build instead.

Platform teams are in the business of removing decisions. Every time a data scientist has to figure out which container registry to push to, which experiment-tracking endpoint to configure, or which serving framework to use, they are spending cognitive budget on infrastructure rather than on the model. The golden path pattern — first articulated publicly by Spotify's engineering team and formalised in Skelton and Pais's team interaction model — addresses this directly: define a paved, opinionated workflow for the common cases, pre-wire the integrations, and let product teams walk the path without understanding what is under it.

Spotify's 2020 engineering blog post "How We Use Golden Paths to Solve Fragmentation in Our Software Ecosystem" describes the paved-road metaphor precisely: a path is not a mandate. A team that needs to diverge can do so, but they leave the path and take on the maintenance burden of whatever they build instead. The CNCF TAG App Delivery Platforms Whitepaper (2023) formalises this at the industry level, describing the platform's job as offering "a bundle often described as a golden path" accompanied by an initial project template and documentation. Both framings share a key discipline: a golden path is only golden if it is kept up to date. A stale path is worse than no path — it channels teams into known-bad configurations.

This article defines the three canonical golden paths for an ML platform, the mechanism used to stamp them out as templates, the governance gates wired into each path, and — critically — the deprecation contract that keeps a path trustworthy over time.

What makes a path "golden"

Skelton and Pais's Team Topologies (IT Revolution Press, 2019) frames the platform team as a stream-aligned team's internal supplier. The platform team's primary output is not running services — it is reducing the cognitive load on product teams. The paved road is the primary mechanism: an opinionated, tested, integrated path that a product team can follow without needing to understand the platform in depth.

Three properties distinguish a golden path from a mere tutorial:

  • It is scaffolded, not described. A team starting the path runs one command (or clicks one button in an internal developer portal) and receives a working repository skeleton, pre-wired CI, and pre-configured integrations. Documentation exists, but the path does not require the team to read it before getting started.
  • It enforces a gate. At some point in the path — typically at promotion time or deployment time — an automated gate runs checks (eval scores, model card completeness, latency regressions, security scans). A path without a gate is a convenience; a path with a gate is a quality mechanism.
  • It is versioned and deprecatable. Platform teams evolve their stack. A path is a contract with its consumers. Consumers deserve a defined notice window — typically measured in weeks to months — and a migration script when a path is deprecated. Without this contract, teams fear using the path at all.

The three canonical ML golden paths

Three paths address the bulk of ML workloads on a modern AI platform. Each is described by its trigger (what causes a team to walk this path), its key inputs and outputs, and the gate it enforces.

Path 1 — Batch inference pipeline

Trigger: an ML team has a trained model that produces predictions on a schedule — nightly fraud scores, weekly recommendations, monthly risk ratings — rather than in real time.

Input: a trained model artefact promoted to the model registry's staging stage, and a data source reference (a feature store view, a data-lake partition, or a streaming-snapshot export).

Output: predictions written to an output store (object storage, a database table, or a downstream event stream), with row-count and schema assertions confirming the run succeeded.

Gate: an output-validation step inside the pipeline — schema check, row-count assertion, and a lightweight quality metric check — that blocks promotion of the batch job to the model registry's production stage if any assertion fails. A GitOps controller (e.g. Argo CD or Flux) then detects the production-stage promotion and syncs the scheduled-job manifest to the cluster. Downstream systems see only production-stage outputs.

The pipeline definition lives in a scaffolded Git repository. The scaffold (produced by an internal developer portal template or a Backstage Software Template, as described in the Backstage documentation) pre-wires the experiment tracker, the model registry credential, the output-store path convention, and the CI pipeline that validates the pipeline definition itself before it runs in production.

Path 2 — Model serving: real-time inference

Trigger: an ML team has a model that must produce predictions at request time — fraud detection on a payment, ranking on a search query, content moderation on a submitted post.

Input: a model artefact in the registry, plus a serving manifest (an InferenceService definition for a serving runtime such as KServe, BentoML, or Seldon Core) authored by the ML engineer and committed to a deployment Git repository.

Output: a stable, versioned prediction endpoint consumed by application engineers. The endpoint URI does not change across model revisions — only the model revision behind it changes.

Gate: a CI gate on the deployment repository PR that runs three checks: (1) model-card completeness — the model card must document intended use, training data provenance, and known limitations; (2) eval-score threshold — the model's offline evaluation score must exceed the team's configured minimum; (3) latency regression test — shadow inference against a canary endpoint must show P95 latency within the configured tolerance of the current production model.

After the gate passes and the PR is merged, the GitOps controller syncs the InferenceService manifest. Traffic is initially split — for example, 5% to the new revision, 95% to the previous. A progressive-delivery controller (e.g. Argo Rollouts or Flagger) watches prediction latency, error rate, and prediction-quality metrics. If metrics stay within bounds across a configurable observation window, traffic advances to 100% for the new revision. If metrics breach bounds, the rollout is automatically aborted and the previous revision retakes full traffic. The Argo Rollouts project documents the AnalysisRun and Rollout resource types that implement this pattern.

Path 3 — GenAI feature with a vector index

Trigger: an application team wants to add retrieval-augmented generation (RAG) to a product — a search surface, a Q&A interface, a document assistant.

Input: a document corpus with a defined data access credential (scoped read-only), and a choice of embedding endpoint from the platform's model catalogue.

Output: a running RAG feature backed by a scheduled indexing pipeline and a vector store query endpoint (e.g. pgvector, Qdrant, Weaviate, or Milvus). The LLM inference endpoint is provided by the platform — either self-hosted or a proxied external API — so the application team does not manage model serving directly.

Gate: an offline evaluation harness that runs on the indexing pipeline's output — measuring retrieval recall on a ground-truth question-answer set — and an online evaluation surface (explicit feedback signals captured in the application layer). The offline recall gate blocks promotion to production if recall falls below threshold; the online gate feeds a monitoring dashboard rather than blocking deployment, since production traffic is the only source of real query distribution.

The templating mechanism

A golden path is not a document — it is an executable template. The CNCF Platforms Whitepaper describes this as offering an "initial project template and documentation, a bundle often described as a golden path." The Backstage Software Templates specification (API version scaffolder.backstage.io/v1beta3) is one widely adopted mechanism: a YAML Template document with a spec.parameters section (the inputs the user provides — project name, team, data source reference) and a spec.steps section (the actions the scaffolder runs: fetch a skeleton, render files from a template, open a repository, register the new component in the catalog).

scaffolder-template-batch-model.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: batch-model-pipeline
  title: Batch Model Pipeline
  description: Golden path for scheduling batch inference jobs
spec:
  owner: platform-team
  type: ml-pipeline

  parameters:
    - title: Project details
      required: [modelName, teamSlug, outputStorePrefix]
      properties:
        modelName:
          type: string
          description: Name of the model (must match registry slug)
        teamSlug:
          type: string
          description: Your team identifier for RBAC and labelling
        outputStorePrefix:
          type: string
          description: Object-store prefix for batch output (e.g. s3://data/predictions/)

  steps:
    - id: fetch-skeleton
      name: Fetch pipeline skeleton
      action: fetch:template
      input:
        url: ./skeleton
        values:
          modelName: ${{ parameters.modelName }}
          teamSlug: ${{ parameters.teamSlug }}
          outputStorePrefix: ${{ parameters.outputStorePrefix }}

    - id: publish
      name: Create Git repository
      action: publish:github
      input:
        repoUrl: github.com?owner=${{ parameters.teamSlug }}&repo=${{ parameters.modelName }}-pipeline

    - id: register
      name: Register in catalog
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}

For teams not running an internal developer portal with a scaffolding engine, the same outcome is achievable with Argo CD ApplicationSets using the Cluster generator pattern: a single ApplicationSet template is parameterised from registered cluster Secrets, stamping out one Application per cluster (or per environment) without manual duplication. The Argo CD documentation describes the Cluster generator as the primary mechanism for multi-cluster template instantiation. Kustomize base-plus-overlay provides the per-environment patch layer in both cases — a base directory holds the canonical manifest, and an overlay directory for each environment (dev, staging, production) holds only the values that differ.

applicationset-batch-model.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: batch-model-pipeline
  namespace: argocd
spec:
  generators:
    - clusters: {}   # one Application per registered cluster Secret
  template:
    metadata:
      name: '{{name}}-batch-model'
    spec:
      project: ml-workloads
      source:
        repoURL: https://git.example.com/platform/batch-model-base
        targetRevision: HEAD
        path: overlays/{{metadata.labels.env}}
      destination:
        server: '{{server}}'
        namespace: ml-inference
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

The choice of mechanism — portal scaffold, ApplicationSet, or Kustomize overlay — depends on what the platform already operates. The key discipline is the same in all cases: the template is the canonical source of truth for the path. A team that modifies the generated skeleton is drifting from the path; the platform team's tooling detects that drift via GitOps sync-status checks and surfaces it to both teams.

Governance gates wired into the path

A golden path is valuable because it enforces good defaults automatically. The gates that matter most for ML workloads sit at three points:

  1. Registry promotion gate. Before a model artefact moves from staging to production in the model registry, it must pass automated checks: minimum eval score, model card completeness, and (for regulated industries) an explicit reviewer signoff. The model registry's webhook or event integration triggers the CI gate; the gate's pass/fail result is written back to the registry as a metadata annotation. This makes the gate auditable — any downstream system can query whether a given model version passed all gates.
  2. Deployment PR gate. When an ML engineer opens a PR against the deployment repository, the CI pipeline runs the model-card check, eval-score threshold, and latency regression test (Path 2) or recall-on-ground-truth check (Path 3). This gate runs in the CI system — not in the cluster — so it fails fast, before the GitOps controller ever sees the manifest.
  3. Runtime rollout gate. After deployment, the progressive-delivery controller observes live metrics. For serving models (Path 2), this means request latency and error rate from the serving layer's metrics endpoint, plus any model-quality signal the application emits. For batch models (Path 1), this means the output-validation step in the pipeline itself. The rollout gate is the safety net for the cases the CI gate did not catch — distribution shift detected only under real traffic, latency regression that appears only at production request volumes.

The deprecation contract

A golden path that cannot be deprecated safely becomes technical debt. Platform teams that skip the deprecation contract find themselves maintaining old path versions indefinitely — because consumers are stuck on them, because no migration tooling was provided, because the notice window was too short. The pattern for a trustworthy deprecation contract has four steps:

  • Announce with a defined notice window. Consumers of the path get a notice period — typically measured in weeks to months — before the old path is removed. No standard mandates a specific number; the appropriate window depends on the consumer's release cadence and the complexity of the migration.
  • Provide a migration script or automated PR. The platform team does not announce a deprecation and leave consumers to figure out the migration themselves. The scaffolding system opens automated PRs against consumer repositories — replacing old template references with the new version, updating dependency pins, adjusting CI configuration. Backstage's Software Templates and the scaffolder action system support this pattern natively.
  • Track adoption. The platform team maintains an inventory of which repositories are on which path version — sourced from the IDP catalog or from Git metadata. Deprecation is not complete until every consumer has migrated or has been deliberately granted an extension.
  • Remove on schedule. The old path version is removed at the end of the notice window. Exceptions are tracked explicitly and have an expiry date. An exception that has no expiry date is a permanent fork — the condition that the deprecation contract exists to prevent.

The deprecation contract is also the primary argument for investing in golden paths at all. A team that is not confident the platform will maintain its golden paths will build their own infrastructure — defeating the consolidation goal. Trust in the path's stability is a prerequisite for adoption.

The off-ramp and when to take it

Golden paths address the majority of workloads, not all of them. Teams encounter off-ramps when their requirements exceed the path's design envelope:

  • Path 1 off-ramps include multi-node distributed training jobs, non-standard output destinations, and pipeline dependencies on systems the platform does not yet integrate with.
  • Path 2 off-ramps include streaming inference (event-triggered prediction), multi-model ensembles, and custom pre/post-processing pipelines that do not fit the serving runtime's transformer abstraction.
  • Path 3 off-ramps include hybrid search (keyword plus semantic), custom re-ranking pipelines, and multi-turn agent loops with tool use — which extend beyond simple RAG into agentic infrastructure.

The discipline at the off-ramp matters more than the path itself. When a team hits an off-ramp, the platform team has three options: extend the path (add the capability to the template), document the divergence pattern (add it to an extension catalogue), or accept the team building independently (with explicit acknowledgement that they own the maintenance). Which option applies depends on how many teams share the need. A one-team requirement is a candidate for independent build; a requirement shared by three or more teams is a candidate for path extension.

Connecting the three paths to the broader platform

The three paths are built on top of platform capabilities described elsewhere in this series. The toolchain that makes path scaffolding possible — experiment trackers, model registries, serving runtimes, vector stores — is covered in the composable AI toolchain article. The GitOps machinery that makes the deployment step in Paths 1 and 2 work — the controller, the manifest conventions, the sync policies — is covered in the CI/CD and GitOps article. A golden path is not a platform feature in isolation — it is the orchestrated composition of several platform capabilities into an end-to-end workflow a product team can actually use.

The discoverability of the paths is equally important. A golden path that is not surfaced in the internal developer portal is a golden path that most teams will not find. The IDP catalog — whether Backstage-based or another portal — should surface the available templates, the version each team is on, and the status of any active deprecations. Discoverability is not a UX concern; it is a platform adoption concern.

References

  1. Spotify Engineering. "How We Use Golden Paths to Solve Fragmentation in Our Software Ecosystem." 2020. engineering.atspotify.com
  2. Skelton, M. & Pais, M. Team Topologies: Organizing Business and Technology Teams for Fast Flow. IT Revolution Press, 2019. teamtopologies.com
  3. CNCF TAG App Delivery. "Platforms Whitepaper." 2023. tag-app-delivery.cncf.io
  4. Backstage.io. "Writing Software Templates" (scaffolder.backstage.io/v1beta3). Backstage documentation. backstage.io/docs
  5. Argo CD project. "ApplicationSet Cluster Generator." Argo CD documentation. argo-cd.readthedocs.io

Tags

#golden-paths#backstage#platform-engineering#series:ai-platform-mlops#series-order/22

About the Author

asleekgeek

asleekgeek

Senior Developer, Architect, DevOps

Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist

Thanks for reading! Explore more articles.

Back to Articles