AI Platform Engineering & MLOps · Part XXXIII of 34

Observability for GenAI: Prometheus, Grafana, Tempo, and the OpenTelemetry GenAI Conventions

How to wire metrics, traces, and SLOs across LLM calls, tool invocations, and agent turns using the OpenTelemetry GenAI semantic conventions.

12 min read·2 interactive components·10 references

GenAI AppOTelPrometheusGrafana TempoGrafana

A production GenAI deployment is not a single service. It is a call graph: an agent turn fans out into one or more model calls, each of which may trigger tool invocations, which themselves reach retrievers, databases, or downstream APIs. Classical application observability — a latency histogram, an error counter, a single trace per HTTP request — loses most of this structure. The token spend lives inside the inference span. The retrieval latency lives inside the tool span. The conversation thread that correlates three turns of a multi-step agent exists nowhere in a plain HTTP trace.

The OpenTelemetry GenAI semantic conventions give that call graph a shared vocabulary. They define which attribute names carry which facts — provider, model, token counts, tool name, conversation ID — so the collector, the trace store, and the dashboard all agree on what a span means without any team inventing a private schema. This article walks through the conventions, the three-pillar stack that collects and renders them, and the SLI/SLO design that turns raw telemetry into actionable reliability targets for both inference and agent workloads.

The three-pillar stack

Prometheus is a CNCF Graduated project (since August 2018) handling time-series metrics with a pull-based scrape model, a rich query language (PromQL), and a histogram primitive that enables correct percentile computation across replicas. Grafana Tempo stores distributed traces in object storage and intentionally does not index span attributes — you query by trace ID, service name, or TraceQL attribute predicates, which keeps storage costs low regardless of attribute cardinality. Grafana provides the unified UI. OpenTelemetry reached CNCF Graduation on 11 May 2026, cementing its position as the vendor-neutral telemetry standard; it is both the SDK that instruments applications and the Collector that fans telemetry out to Prometheus and Tempo.

The seam between spans and metrics is Tempo’s metrics-generator. It computes span metrics — request count, error count, and duration histograms — for every unique combination of service, operation name, and span kind, and remote-writes the resulting series into the same Prometheus-compatible backend the rest of the platform already uses. The effect is that RED (Rate, Errors, Duration) metrics for every GenAI operation name are available in Prometheus without any workload having to manually expose a Prometheus endpoint.

The OpenTelemetry GenAI semantic conventions

The GenAI semantic conventions define the gen_ai.* attribute namespace. They cover three span shapes that together describe a GenAI call graph: inference spans (a model call), tool spans (a function or retrieval invocation), and agent spans (a turn or agent lifecycle event). As of mid-2026, client spans and metrics have stabilised; agent span conventions remain in Development status — treat attribute names as a moving target and re-check the spec before hard-coding dashboard queries.

A note on the stability opt-in: instrumentations should not silently change which convention version they emit. Set the environment variable

env — semconv version opt-in

OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental

to explicitly select the newest experimental names so a library upgrade cannot silently rename attributes under your dashboards.

Inference spans

The core model-call span has span kind CLIENT for a remote provider and INTERNAL when the model runs in-process. The gen_ai semantic conventions for spans specify the following attributes, among others:

gen_ai.provider.name — the GenAI provider (newer name; older instrumentations emit gen_ai.system)
gen_ai.operation.name — operation type: chat, generate_content, or embeddings
gen_ai.request.model — the model identifier requested
gen_ai.response.model — the model that actually answered (may differ from the requested model)
gen_ai.usage.input_tokens — prompt tokens consumed
gen_ai.usage.output_tokens — completion tokens generated
gen_ai.response.finish_reasons — array of reasons why generation stopped (e.g. stop, length)

Tool invocation spans

A tool call carries gen_ai.operation.name=execute_tool and the following attributes:

gen_ai.tool.name — name of the tool the agent called (required)
gen_ai.tool.call.id — identifier correlating a call to its result
gen_ai.tool.type — tool category: function, extension, or datastore

Agent spans

Agent lifecycle uses gen_ai.operation.name=invoke_agent for an agent turn and create_agent for construction, carrying gen_ai.agent.id, gen_ai.agent.name, and gen_ai.conversation.id for session correlation. The spec distinguishes a CLIENT span (agent is a remote service) from an INTERNAL span (agent runs in-process), but both carry the same attribute set, so a single TraceQL filter on gen_ai.operation.name=invoke_agent captures every agent turn regardless of deployment topology. gen_ai.conversation.id is the thread stitching a multi-turn session together — use it as both a metrics dimension (per-conversation latency) and a TraceQL filter (replay one session’s full span tree). See the GenAI agent span conventions for the full attribute listing.

The inspector below shows an example agent-turn trace. Click any span to see its gen_ai.* attributes and the SLI they feed.

OTel GenAI Span Inspector

Example agent-turn trace. Click any span to inspect its gen_ai.* attributes and the SLI they feed.

SpanDurationms

invoke_agentOKspan.kind=INTERNAL4230ms

Attribute	Value	Stability
gen_ai.operation.name	invoke_agent	required
gen_ai.agent.name	support-agent	required
gen_ai.agent.id	agent_001	required
gen_ai.conversation.id	conv_XyZ789	required

SLI signal

Trajectory success rate SLI: this span succeeds ⇒ counts as a good event. Abnormal step count here is a saturation signal.

requiredrecommendedoptional

Collector pipeline: receiving and routing `gen_ai` spans

Applications export OTLP spans to the OpenTelemetry Collector. The collector’s job is to batch spans, apply attribute policies that keep the gen_ai.* keys you actually dashboard on, and forward to the trace backend. A minimal illustrative configuration:

otel-collector-config.yaml (illustrative)

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch: {}
  attributes/genai:
    actions:
      - key: gen_ai.provider.name
        action: upsert
      - key: gen_ai.operation.name
        action: upsert
      - key: gen_ai.request.model
        action: upsert
      - key: gen_ai.usage.input_tokens
        action: upsert
      - key: gen_ai.usage.output_tokens
        action: upsert
      - key: gen_ai.tool.name
        action: upsert
      - key: gen_ai.conversation.id
        action: upsert

exporters:
  otlp/tempo:
    endpoint: tempo:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes/genai]
      exporters: [otlp/tempo]

Tune to your environment: add tail-sampling processors if you want to keep all error traces and drop a fraction of successful ones, and add a redaction processor if prompt content must not leave the cluster. These are reference patterns, not validated production configurations.

On the Tempo side, configure the metrics-generator to retain gen_ai.operation.name as a dimension so one RED stream separates inference latency from tool latency from agent-turn latency without re-instrumenting anything. Token counts are attribute values, not durations — chart them via TraceQL metrics over the usage attributes rather than from the duration-based span-metrics stream.

The explorer below traces a GenAI request through the pipeline. Select a span type and animate to see how each stage handles the gen_ai.* attributes.

Telemetry Flow Explorer

Select a span type, then animate to trace its journey through the stack.

A model call with span kind CLIENT (remote provider) or INTERNAL (in-process). Carries provider, model, token counts, and finish reasons.

Attribute	Value
gen_ai.operation.name	chat
gen_ai.provider.name	openai
gen_ai.request.model	gpt-4o
gen_ai.usage.input_tokens	412
gen_ai.usage.output_tokens	87
gen_ai.response.finish_reasons	["stop"]
gen_ai.response.model	gpt-4o-2024-08-06

Pipeline

OTel SDK: SDK emits span with gen_ai.usage.* on request completion.

SLI and SLO design for GenAI workloads

An SLI (Service Level Indicator) is a measured ratio of good events to valid events. An SLO (Service Level Objective) is the target that ratio must meet over a window. The error budget is the complement of the SLO — a 99% objective permits 1% of bad events over the window, and consuming that budget is the signal that triggers both automated alerting and engineering action. This model is described in detail in the Google SRE Workbook, Implementing SLOs.

Inference endpoint SLIs

Four signal families cover what callers experience at an inference endpoint:

Latency — per-request wall time at p50, p95, p99. Report as histograms (not gauges or summaries) to enable correct aggregation across replicas and percentile recomputation after the fact.
Availability — fraction of valid requests returning a non-error response. Exclude 4xx (caller errors) from the budget; count 5xx and timeouts as bad events.
Error rate — fraction of requests that fail or time out. Tracked separately from availability for alerting granularity.
Saturation — how full the serving queue is relative to actively running requests. A leading indicator: saturation predicts latency and error degradation before users observe it.

For streaming LLM endpoints — where tokens arrive incrementally and the user reads the first token long before the last arrives — a single latency number is misleading. Split latency into two separate SLIs:

Time to first token (TTFT) — from request receipt to the first streamed token. Dominated by prompt-prefill cost and queue wait. This is perceived responsiveness: it is what makes a chat interface feel alive or sluggish.
Inter-token latency (time per output token) — the decode throughput per generated token. Dominated by output length and batch size. A long prompt with a short answer has high TTFT and low inter-token latency; the reverse pattern requires a different objective.

Agent endpoint SLIs

Agent workloads require additional SLIs beyond what inference endpoints need, because the failure modes are different. A single LLM call either succeeds or fails; an agent trajectory can complete with the wrong answer, use more steps than necessary, or call tools that error without the overall trace indicating a fault. Research on trajectory evaluation — including work on tool-augmented agent benchmarks (see arXiv:2510.02837) — distinguishes efficiency and correctness signals that translate into platform-side SLIs:

Trajectory success rate — fraction of agent turns that complete without an unhandled error or an abort. Measured from the invoke_agent spans in Tempo. Note: a specific percentage target (e.g. ≥90%) is workload-specific; calibrate against your own observed distribution before committing an objective.
Tool error rate — fraction of execute_tool spans ending in an error status. Surfaced directly from Tempo’s metrics-generator once gen_ai.tool.name is retained as a dimension.
Retrieval relevance — whether the retriever surface returned contextually relevant results. The literature establishes this as a valid quality signal (contextual relevancy, faithfulness scores), but the exact threshold is workload-specific. This SLI is best emitted by the retrieval layer itself as a custom span attribute and surfaced through Tempo, rather than computed by the platform.
Step count per turn — the number of tool invocations nested within a single invoke_agent span. An unexpectedly high step count signals loop behaviour or a degenerate plan, and is a useful saturation signal even before errors appear.

Multi-window burn-rate alerting

Alerting directly on a threshold in a single time window is noisy: a short window pages on every transient blip; a long window takes hours to notice a hard outage. The Google SRE Workbook, Alerting on SLOs describes multi-window, multi-burn-rate alerting: pair a fast-burn alert (short + long window, both breaching) for acute outages with a slow-burn alert (longer windows) for gradual erosion. The short window confirms the condition is still active; the long window suppresses flapping.

Burn rate is how fast the error budgetis being consumed relative to spending it evenly across the full window. A burn rate of 1.0 exhausts the budget exactly at the window’s end; a rate of 14.4 would exhaust a 30-day budget in roughly two days. The Workbook’s worked example for a 99.9% objective produces these pairings:

Page (fast): 1h long window / 5m short window — burn rate 14.4 — consumes ~2% of budget before firing
Page (medium): 6h long window / 30m short window — burn rate 6 — consumes ~5% of budget before firing
Ticket (slow): 3d long window / 6h short window — burn rate 1 — consumes ~10% of budget before firing

Expressed as a concrete PromQL alert for the latency SLI, assuming a 99.9% objective (error budget = 0.001) and recording rules that pre-compute the error ratio:

alerting-rules.yaml (illustrative — tune to your objective)

groups:
  - name: genai_slo_alerts
    rules:
      # Fast-burn page: both 5m and 1h windows exceed 14.4x budget spend
      - alert: GenAILatencySLOFastBurn
        expr: |
          (
            slo:genai_latency:error_ratio:rate5m  > (14.4 * 0.001)
            and
            slo:genai_latency:error_ratio:rate1h  > (14.4 * 0.001)
          )
        for: 2m
        labels:
          severity: page
        annotations:
          summary: 'GenAI latency SLO fast-burn: exhausting 30-day budget in ~2 days'

      # Medium-burn page: 6h and 30m windows both exceed 6x budget spend
      - alert: GenAILatencySLOMediumBurn
        expr: |
          (
            slo:genai_latency:error_ratio:rate30m > (6 * 0.001)
            and
            slo:genai_latency:error_ratio:rate6h  > (6 * 0.001)
          )
        for: 15m
        labels:
          severity: page
        annotations:
          summary: 'GenAI latency SLO medium-burn: gradual erosion detected'

      # Slow-burn ticket: 3d and 6h windows at 1x budget spend
      - alert: GenAILatencySLOSlowBurn
        expr: |
          (
            slo:genai_latency:error_ratio:rate6h  > (1 * 0.001)
            and
            slo:genai_latency:error_ratio:rate3d  > (1 * 0.001)
          )
        for: 1h
        labels:
          severity: ticket
        annotations:
          summary: 'GenAI latency SLO slow-burn: budget eroding below baseline'

The constants 14.4, 6, and 0.001 assume a 99.9% objective over a 30-day window. Change the objective and every multiplier moves. Re-derive from your own target with the Workbook’s formula rather than copying these values.

Two layers, one stack

A recurring source of confusion in GenAI platform engineering is the boundary between platform observability and LLM observability. The distinction is worth stating explicitly:

Platform observability answers: is the agent pipeline fast, cheap, and up? It reads gen_ai.* spans in Tempo and derived RED metrics in Prometheus. It owns the collector pipeline, Tempo storage, span-metrics generation, and the latency/throughput/error SLOs.
LLM observability answers: is the agent correct? It reads the same spans but scores them on retrieval relevance, citation faithfulness, and answer quality. It lives downstream of the platform layer and does not duplicate the collection pipeline.

Both layers use the same vocabulary — the GenAI semantic conventions — and the same trace store. The split is ownership, not infrastructure. The platform team sets SLOs on latency and availability; the application team sets quality thresholds on relevance and faithfulness. Neither can substitute for the other.

Adoption checklist

1Pin a semconv version. Set OTEL_SEMCONV_STABILITY_OPT_IN deliberately. Do not let a library upgrade silently rename attributes under running dashboards.
2Preserve gen_ai.* keys through the collector. Add an attributes processor so the keys you dashboard on survive batching and any redaction policy.
3Make gen_ai.operation.name a metrics dimension in Tempo’s metrics-generator. One RED stream then separates inference, tool, and agent spans without any re-instrumentation.
4Chart token spend off span attributes, not RED metrics. Token counts are attribute values — use TraceQL metrics over gen_ai.usage.input_tokens and gen_ai.usage.output_tokens.
5Use gen_ai.conversation.id as a metrics dimension for per-conversation latency and as a TraceQL filter for session replay. It is the primary correlation handle for multi-turn agents.
6Set objectives from observed distributions, not assumed baselines. Graph the histogram quantiles over a representative period before committing a target. An objective tighter than current p95 is a backlog item, not a true SLO.
7Re-derive burn-rate multipliers from your own target. The 14.4 / 6 / 1 constants in the examples above assume a 99.9% objective over 30 days — change the objective and every number moves.

References

[1] OpenTelemetry — Semantic Conventions for Generative AI Systems (gen_ai.* namespace, stability status, attribute listing). OpenTelemetry Authors, 2024–2026.
[2] OpenTelemetry — Semantic Conventions for GenAI Agent and Framework Spans (invoke_agent, create_agent, gen_ai.conversation.id). OpenTelemetry Authors, 2025–2026.
[3] OpenTelemetry — Semantic Conventions for Generative AI Client Spans (execute_tool, gen_ai.tool.* attributes). OpenTelemetry Authors, 2025–2026.
[4] CNCF — OpenTelemetry Graduation Announcement (11 May 2026). Cloud Native Computing Foundation, 2026.
[5] CNCF — Prometheus Project Page (graduation status August 2018). Cloud Native Computing Foundation.
[6] Grafana Tempo — Metrics-Generator (span metrics, RED derivation, service graph). Grafana Labs.
[7] Google SRE Workbook — Implementing SLOs (SLI/SLO/error-budget model). Beyer, Jones, Petoff, Murphy (eds.), Google, 2018.
[8] Google SRE Workbook — Alerting on SLOs (multi-window burn-rate alerting, constants). Google, 2018.
[9] “Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents” (trajectory success rate, step-count, tool-error metrics as evaluation signals). arXiv:2510.02837, 2025.
[10] OpenTelemetry Collector documentation. OpenTelemetry Authors.

Continue the Journey

AI Platform

Observability for GenAI: Prometheus, Grafana, Tempo, and the OpenTelemetry GenAI Conventions

The three-pillar stack

The OpenTelemetry GenAI semantic conventions

Inference spans

Tool invocation spans

Agent spans

OTel GenAI Span Inspector

Collector pipeline: receiving and routing `gen_ai` spans

Telemetry Flow Explorer

SLI and SLO design for GenAI workloads

Inference endpoint SLIs

Agent endpoint SLIs

Multi-window burn-rate alerting

Two layers, one stack

Adoption checklist

References

Continue the Journey

RAG and Agent Observability

Eval as the New Test Suite

The AI Gateway

Incident Response for ML

The three-pillar stack

The OpenTelemetry GenAI semantic conventions

Inference spans

Tool invocation spans

Agent spans

OTel GenAI Span Inspector

Collector pipeline: receiving and routing gen_ai spans

Telemetry Flow Explorer

SLI and SLO design for GenAI workloads

Inference endpoint SLIs

Agent endpoint SLIs

Multi-window burn-rate alerting

Two layers, one stack

Adoption checklist

References

Continue the Journey

RAG and Agent Observability

Eval as the New Test Suite

The AI Gateway

Incident Response for ML

Collector pipeline: receiving and routing `gen_ai` spans