The six roles on an AI platform — what each does and what each is fluent in

·12 min read·asleekgeek
Six interconnected role cards on an AI platform team, showing Platform Engineer, MLOps Engineer, ML Engineer, Data Scientist, DevOps/SRE, and Data Engineer

Six roles, distinct optimisation functions — clarity here is what makes hiring tractable.

Role titles in AI and ML infrastructure are not standardised. The same responsibilities appear under “AI Platform Engineer”, “ML Infra Engineer”, “AI Reliability Engineer”, and half a dozen other labels depending on the company. LinkedIn’s 2026 Jobs on the Rise report found four of the five fastest-growing roles in the U.S. were AI-related — and AI/ML job postings surged 163% from 2024 to 2025, reaching roughly 49,000 open positions in the U.S. alone. That volume of demand, met with non-standardised titling, produces systematic mis-hiring.

This article maps the six roles that appear on a working AI platform team. Four are core: Platform Engineer (AI Platform), MLOps Engineer, ML Engineer, and Data Scientist. Two are habitually conflated with the core four: DevOps/SRE and Data Engineer. For each role this article gives the job-to-be-done, the day-one and senior skill bar, the tools expected at each level, and — the most useful part — the anti-patterns that appear when an organisation hires for one role and uses the person as another. The previous article in this series covered what an AI Platform team owns at the team level; this one goes one level down to individual role definitions.

Why role clarity matters more than title uniformity

The problem is not that companies use different titles — it is that they conflate distinct optimisation functions. A DevOps engineer asked to “also handle MLOps” is being asked to operate a model lifecycle on top of a general infrastructure role, which means one of the two halves rots. An ML engineer hired to run the cluster is being asked to spend cognitive budget on GPU driver versions and network policy instead of on model architecture. The skill overlap is real; the optimisation function is not the same.

The SFIA 9 framework (Skills Framework for the Information Age, published by BCS and the SFIA Foundation) is the most widely used international standard for mapping digital skills to responsibility levels. It defines the Machine Learning skill (code MLNG) across seven levels — from assisted data preparation at level 2 to setting the organisation’s strategic ML direction at level 7 — and it explicitly separates the “building and training models” competency from the “operationalising ML pipelines” competency. That separation is the clearest industry-framework signal that MLOps and ML engineering are distinct practices, not two names for the same thing. The LinkedIn Jobs on the Rise 2026 report similarly treats MLOps engineers and AI infrastructure engineers as separate categories in the fastest-growing roles list, as does the CNCF Platforms White Paper, which distinguishes platform capability providers from platform users.

Skills matter more than titles. The six sections below use industry-generic role labels. Your org chart may say something different; what matters is whether the optimisation function described here matches the person you hired.

Role 1: Platform Engineer (AI Platform)

Job-to-be-done

Owns the Kubernetes-and-GPU substrate that the rest of the ML stack runs on. Stands up and maintains the GPU operator, the job scheduler, the model registry deployment, the serving runtime, and the observability that watches all of it. Measure of success: paved-road adoption and substrate reliability.

Skill bar

  • Day one: Senior Kubernetes admin level. Comfortable with Helm, kubectl, RBAC, and network policy. Has run a production cluster beyond a local development environment.
  • Month three: Has shipped a GPU job-queueing setup on the team’s GPU pool. Owns the NVIDIA GPU Operator deployment. Has a working model registry deployed with consumers connecting from at least one cluster.
  • Senior: Has designed the multi-tenant fairness story across competing teams sharing the same GPU pool. Has rolled out a CNI-level change without downtime. Can debug NCCL on a multi-node training job from packet capture.

Core skills

  • Kubernetes, deep — operators, CRDs, custom controllers
  • GPU resource model — NVIDIA device plugin, MIG, MPS, Dynamic Resource Allocation (DRA)
  • Helm and Kustomize for release packaging
  • Kubernetes networking — CNI, network policy, eBPF-based network observability
  • GitOps (e.g. Argo CD, Flux CD) — reconciliation model, ApplicationSet, drift detection
  • Linux performance tuning for GPU workloads — NUMA topology, PCIe bandwidth, GPU-direct RDMA

Tool fluency expected

NVIDIA GPU Operator; a gang-scheduling solution (e.g. Volcano) for multi-node training; a cluster-level queue and quota system (e.g. Kueue); a GitOps controller (e.g. Argo CD, Flux CD); Helm; an eBPF-based CNI (e.g. Cilium). Awareness of serving runtimes and registry tooling as substrates they make available to the rest of the org.

Anti-patterns

  • Hiring an AI Platform Engineer to write training code. They will leave or produce poor model code. Training code is an ML Engineer’s responsibility.
  • Hiring an AI Platform Engineer who has not operated a multi-node GPU cluster. GPU concerns — NCCL, topology, MIG partition planning, driver version matrix — are not the same as CPU concerns. A generalist platform engineer who never faces this gap leaves the GPU stack fragile.

Role 2: MLOps Engineer

Job-to-be-done

Owns the lifecycle of production models on the platform — the training pipelines, the model registry contract, the deployment promotion gates, the retraining triggers, and the rollback path. Measure of success: model-system reliability — uptime, prediction quality, retraining cadence.

Skill bar

  • Day one: Has shipped at least one production ML pipeline. Comfortable with Python, Docker, a workflow engine, and a model registry.
  • Month three: Has built or significantly extended a training pipeline on the team’s cluster. Owns the registry contract (versioning, lifecycle states) and has promoted at least one model end-to-end via GitOps.
  • Senior: Has built the eval-in-the-loop rollout gate. Owns the drift-detection story end-to-end. Has rolled back a bad model without taking the serving layer down.

Core skills

  • Python at production quality — typing, error handling, testing
  • Workflow orchestration — DAG construction, retry logic, parameter sweeps
  • Model registry mechanics — versioning, artifact storage, lifecycle state machine
  • Container image promotion and CI/CD integration
  • Basic Kubernetes at consumer level — enough to write a Job manifest and interpret pod logs, not to operate the cluster
  • Evaluation methodology — offline metrics, canary analysis, data drift detection

Tool fluency expected

A workflow orchestrator (e.g. Argo Workflows, Kubeflow Pipelines, Airflow); an experiment and model registry (e.g. MLflow, Weights & Biases); a GitOps controller as a deployment consumer; a progressive delivery tool (e.g. Argo Rollouts, Flagger) for canary and rollback; a CI system.

Anti-patterns

  • Hiring an MLOps Engineer to write training code. MLOps Engineers operate the model lifecycle; ML Engineers write the model code. Conflating them means one of the two halves is always understaffed.
  • Hiring an MLOps Engineer whose background is exclusively data pipelines (ETL, dbt, Airflow) with no model lifecycle exposure. They can ship a pipeline but not a registry contract or a production rollout gate. The seam between a data pipeline and a model pipeline is real and matters.

Role 3: ML Engineer (a.k.a. ML Software Engineer)

Job-to-be-done

Owns the model code — training scripts, inference handlers, evaluation harnesses, and the production-grade Python that turns research into a deployable artefact. Sits between the data scientist and the platform.

Skill bar

  • Day one: Senior Python developer. Has shipped a model into production. Comfortable with PyTorch or TensorFlow, distributed training basics (DDP, FSDP), and at least one serving runtime.
  • Month three: Has refactored a notebook prototype into production code with proper logging, error handling, and registry integration. Has run at least one distributed training job.
  • Senior: Has built a custom inference handler (e.g. a custom serving predictor or an extension to an inference runtime). Has optimised a training job for throughput — DDP tuning, FSDP sharding, mixed precision. Can read a paper and ship a working implementation.

Core skills

  • Deep Python — typing, performance profiling, production error handling
  • Deep learning frameworks — PyTorch primarily, TensorFlow secondarily
  • Distributed training — DDP, FSDP, and pipeline parallelism patterns
  • Portable inference formats — ONNX, TorchScript, and serialisation trade-offs
  • GPU mechanics at the code level — memory layout, mixed precision, kernel selection
  • Evaluation methodology — offline metrics, ablation design, statistical validity

Tool fluency expected

PyTorch; a training-loop abstraction library (e.g. Hugging Face Accelerate, PyTorch Lightning); a Kubernetes distributed training operator (e.g. Kubeflow Training Operator — PyTorchJob/MPIJob); the model registry client (e.g. MLflow client); an LLM serving runtime (e.g. vLLM, Triton Inference Server) at deployer depth; an experiment tracking tool (e.g. MLflow, Weights & Biases).

Anti-patterns

  • Hiring an ML Engineer who is also expected to operate the cluster. They will write good model code on a fragile substrate. Substrate is the AI Platform Engineer’s job.
  • Hiring a data scientist into an ML Engineer role. Data scientists optimise for what model to build; ML engineers optimise for what model to ship. The skills overlap but the optimisation function does not.

Role 4: Data Scientist

Job-to-be-done

Owns model design and selection — research the right approach, build the prototype, prove the metric, document the experiment. Hands off to ML Engineering for productionisation or to MLOps for deployment. Measure of success: the quality and defensibility of the model decision.

Skill bar

  • Day one: Strong statistics and ML fundamentals. Comfortable with Python, Jupyter, and the major framework ecosystem. Has shipped at least one model that influenced a business decision.
  • Month three: Owns the modelling decisions for at least one product. Has run a rigorous experimental design — offline evaluation, A/B test, or equivalent. Knows when to use a tree model and when to use a neural net.
  • Senior: Designs evaluation methodology that distinguishes “the model is better” from “the test set is leaking”. Can challenge a product team on whether ML is the right solution for the problem.

Core skills

  • Statistics and ML theory — distributions, hypothesis testing, overfitting analysis
  • Python — production-comfortable but not necessarily production-quality
  • Experimental design — treatment/control framing, power analysis, confound identification
  • Data exploration — pandas, polars, SQL at analytical depth
  • Model interpretation — SHAP, partial dependence plots, ablations
  • Domain expertise in at least one problem area

Tool fluency expected

Jupyter (or equivalent notebook environment); pandas and/or polars; scikit-learn; PyTorch; the model registry tracking client (e.g. MLflow client-side, Weights & Biases) for experiment logging.

Anti-patterns

  • Expecting a Data Scientist to own the production pipeline. They produce the model; the system around the model belongs to ML or MLOps Engineering. The “data scientist who deploys” exists and is senior; the role is usually distinct.
  • Hiring a Data Scientist with no statistics background and only ML library experience. They can run a notebook; they cannot tell you whether the result is significant. The library-only profile is increasingly common and increasingly insufficient for any consequential model decision.

Role 5: DevOps / SRE (related, not the same)

Job-to-be-done

Owns the general infrastructure — CI runners, application platforms, networking, identity, and on-call rotation for non-ML services. Conflated with AI Platform Engineering because both deal with Kubernetes; distinct because the AI Platform Engineer carries GPU-specific depth that a general SRE does not.

Skill bar

  • Day one: Standard SRE skill set — Linux, Kubernetes, infrastructure-as-code, CI/CD, on-call experience.
  • Month three: Owns the application platform that the rest of the organisation consumes.
  • Senior: Designs the failure-mode story across the application platform — incident response, runbook coverage, SLO definition.

Core skills

  • Linux — systemd, cgroups, kernel namespaces
  • Kubernetes at admin level — not GPU-specialised, but cluster operations and RBAC depth
  • Infrastructure as code — Terraform or equivalent
  • CI/CD systems
  • Observability — metrics, logs, traces using open standards (e.g. Prometheus, OpenTelemetry)
  • Incident response and identity (OIDC, federation)

Tool fluency expected

Terraform or an equivalent IaC tool; a GitOps controller; a metrics stack (e.g. Prometheus, Grafana); a log aggregation tool; an identity provider integration.

Anti-patterns

  • Treating an AI Platform Engineer as “the SRE who knows about GPUs”. The role overlaps but the depth is in different places. A senior SRE without GPU experience needs meaningful ramp time to do the AI Platform job — they are not a drop-in replacement.
  • Routing ML drift alerts to the SRE on-call. Drift is an MLOps concern; the SRE’s on-call is for substrate failures. Mixing the two alert queues burns out the SRE on signals they cannot action.

Role 6: Data Engineer (related, not the same)

Job-to-be-done

Owns the data pipelines — ETL/ELT into the warehouse, feature store population, and the data-quality contracts that downstream consumers depend on. Conflated with MLOps because both run pipelines; distinct because the output artefact is data, not a model.

Skill bar

  • Day one: SQL and data-modelling depth. Comfortable with a warehouse engine and at least one orchestration tool.
  • Month three: Owns a meaningful slice of the data layer feeding ML workloads.
  • Senior: Designs the data-quality contract and SLA across producers and consumers. Understands model training data requirements well enough to write the schema contract.

Core skills

  • SQL — deep analytical and data modelling depth (Kimball or data-vault patterns)
  • Warehouse engine — at least one of Snowflake, BigQuery, Databricks, or an equivalent
  • Pipeline orchestration — Airflow, dbt, Dagster, or equivalent
  • Data quality tooling — dbt tests, Great Expectations, or equivalent contract testing
  • Streaming fundamentals — Kafka and, at senior levels, Flink or Spark Structured Streaming

Tool fluency expected

dbt; Airflow or Dagster; the team’s warehouse; a data quality tool. The lakeFS 2025 State of Data and AI Engineering report notes the Data Engineer skill mix is shifting toward real-time pipeline patterns and governance work that supports AI at scale — feature store population and ML data contract ownership are now common responsibilities.

Anti-patterns

  • Treating Data Engineering as upstream of MLOps with no shared contract. Either the Data Engineer is part of the ML feedback loop or the ML team rebuilds the data pipeline. The shared contract is the feature store or the feature view — something with a schema and an SLA.
  • Hiring a Data Engineer to do MLOps. They will build a solid data pipeline and a marginal model pipeline. The optimisation function differs — data pipelines optimise for data freshness and schema stability; model pipelines optimise for training reproducibility and rollout safety.

The most common conflation: MLOps vs ML Engineering

The role most frequently mis-hired is MLOps Engineer, because the title suggests both model code and operations. Multiple industry analyses of job postings in 2025 and 2026 note that hiring managers conflate the two, and that job descriptions routinely bundle model development, deployment, and infrastructure under a single title — a pattern that SFIA 9 explicitly separates across different competency clusters.

The operational split:

  • MLOps Engineer owns the lifecycle of the model on the platform — pipelines, registry, rollouts, retraining triggers, observability. They may never write a training loop.
  • ML Engineer owns the production-grade model code — training scripts, inference handlers, evaluation harnesses, optimisation. They may never touch GitOps.

In a small team one person does both. In any team larger than five, splitting them is what makes both halves work. Myticas Consulting’s 2026 analysis frames this practically: hire an MLOps Engineer when the organisation is ready to move models into production at scale; hire an ML Engineer when the organisation is in active model development. Both needs exist in a mature team simultaneously.

The role-by-role matrix in summary

Because Portable Text does not support tables, the matrix below renders as structured role summaries. Each entry: role / job-to-be-done / primary skill domain / tool category / the mistake to avoid.

  • Platform Engineer (AI Platform) — substrate reliability / Kubernetes + GPU operations / cluster scheduling + GitOps + CNI / do not ask them to write model code
  • MLOps Engineer — model lifecycle / Python + pipeline orchestration + registry / workflow engine + model registry + progressive delivery / do not conflate with ML Engineering or data engineering
  • ML Engineer — model code / deep Python + ML frameworks + distributed training / training framework + serving runtime + experiment tracker / do not ask them to run the cluster
  • Data Scientist — model design / statistics + ML theory + experimental design / notebook + data exploration + experiment tracker / do not expect them to own the production pipeline
  • DevOps / SRE — general infrastructure / Linux + Kubernetes + IaC + incident response / Terraform + GitOps + observability stack / not a drop-in for AI Platform Engineering without GPU ramp time
  • Data Engineer — data pipelines / SQL + data modelling + pipeline orchestration / warehouse + dbt + data quality tooling / not a substitute for MLOps without model lifecycle exposure

What comes next in this series

The next article, The deployment-context spectrum, introduces the four deployment contexts — pure-cloud, on-prem, hybrid, and air-gapped — that every role described here will encounter. The tools change across those contexts; the role definitions above hold across all four.

References

1. LinkedIn News — "LinkedIn Jobs on the Rise 2026: The 25 fastest-growing roles in the U.S." (LinkedIn, 2026). Source for AI/ML job-posting growth figures (163% surge 2024–2025; 49,200 U.S. positions); four of five fastest-growing roles being AI-related; MLOps and AI infrastructure engineers treated as separate categories.

2. SFIA Foundation — "Machine Learning (MLNG) skill definition, SFIA 9" (BCS / SFIA Foundation, 2024). The seven-level MLNG definition separating model-building from MLOps pipeline competencies; international standard for digital skills mapping.

3. SFIA Foundation — "SFIA: a framework for AI skills" (BCS / SFIA Foundation, 2024). AI & data literacy, ML model building, MLOps, and impact-on-jobs competency clusters used to anchor role skill bars.

4. Myticas Consulting — "MLOps vs ML Engineer: Which Should You Hire in 2026?" (Myticas Consulting, 2026). Practitioner analysis of the production-gap problem and decision criteria for hiring MLOps vs ML Engineering roles.

5. lakeFS — "The State of Data and AI Engineering 2025" (lakeFS, May 2025). Survey findings on the shifting Data Engineer skill mix toward real-time pipelines and ML governance responsibilities.

6. CNCF TAG App Delivery — "Platforms Definition White Paper" (CNCF, 2023). The distinction between platform capability providers and platform users, anchoring the Platform Engineer / consumer-role split.

7. Turkovic, I. — "AI Job Titles in 2026: A CTO’s Guide to the Naming Chaos" (April 2026). CTO-level analysis of AI job title non-standardisation and the DevOps/MLOps/Platform engineering conflation pattern.

Tags

#hiring#roles#series:ai-platform-mlops#series-order/03

About the Author

asleekgeek

asleekgeek

Senior Developer, Architect, DevOps

Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist

Thanks for reading! Explore more articles.

Back to Articles