Observability for GenAI: Prometheus, Grafana, Tempo, and the OpenTelemetry GenAI Conventions

The three-pillar GenAI observability stack: traces in Tempo, metrics in Prometheus, visualised in Grafana.
A production GenAI deployment is not a single service. It is a call graph: an agent turn fans out into one or more model calls, each of which may trigger tool invocations, which themselves reach retrievers, databases, or downstream APIs. Classical application observability — a latency histogram, an error counter, a single trace per HTTP request — loses most of this structure. The token spend lives inside the inference span. The retrieval latency lives inside the tool span. The conversation thread that correlates three turns of a multi-step agent exists nowhere in a plain HTTP trace.
The OpenTelemetry GenAI semantic conventions give that call graph a shared vocabulary. They define which attribute names carry which facts — provider, model, token counts, tool name, conversation ID — so the collector, the trace store, and the dashboard all agree on what a span means without any team inventing a private schema. This article walks through the conventions, the three-pillar stack that collects and renders them, and the SLI/SLO design that turns raw telemetry into actionable reliability targets for both inference and agent workloads.
The three-pillar stack
Prometheus is a CNCF Graduated project (since August 2018) handling time-series metrics with a pull-based scrape model, a rich query language (PromQL), and a histogram primitive that enables correct percentile computation across replicas. Grafana Tempo stores distributed traces in object storage and intentionally does not index span attributes — you query by trace ID, service name, or TraceQL attribute predicates, which keeps storage costs low regardless of attribute cardinality. Grafana provides the unified UI. OpenTelemetry reached CNCF Graduation on 11 May 2026, cementing its position as the vendor-neutral telemetry standard; it is both the SDK that instruments applications and the Collector that fans telemetry out to Prometheus and Tempo.
The seam between spans and metrics is Tempo's metrics-generator. It computes span metrics — request count, error count, and duration histograms — for every unique combination of service, operation name, and span kind, and remote-writes the resulting series into the same Prometheus-compatible backend the rest of the platform already uses. The effect is that RED (Rate, Errors, Duration) metrics for every GenAI operation name are available in Prometheus without any workload having to manually expose a Prometheus endpoint.
The OpenTelemetry GenAI semantic conventions
The GenAI semantic conventions define the gen_ai.* attribute namespace. They cover three span shapes that together describe a GenAI call graph: inference spans (a model call), tool spans (a function or retrieval invocation), and agent spans (a turn or agent lifecycle event). As of mid-2026, client spans and metrics have stabilised; agent span conventions remain in Development status — treat attribute names as a moving target and re-check the spec before hard-coding dashboard queries.
A note on the stability opt-in: instrumentations should not silently change which convention version they emit. Set the environment variable
OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimentalto explicitly select the newest experimental names so a library upgrade cannot silently rename attributes under your dashboards.
Inference spans
The core model-call span has span kind CLIENT for a remote provider and INTERNAL when the model runs in-process. The gen_ai semantic conventions for spans specify the following attributes, among others:
gen_ai.provider.name— the GenAI provider (newer name; older instrumentations emitgen_ai.system)gen_ai.operation.name— operation type:chat,generate_content, orembeddingsgen_ai.request.model— the model identifier requestedgen_ai.response.model— the model that actually answered (may differ from the requested model)gen_ai.usage.input_tokens— prompt tokens consumedgen_ai.usage.output_tokens— completion tokens generatedgen_ai.response.finish_reasons— array of reasons why generation stopped (e.g.stop,length)
Tool invocation spans
A tool call carries gen_ai.operation.name=execute_tool and the following attributes:
gen_ai.tool.name— name of the tool the agent called (required)gen_ai.tool.call.id— identifier correlating a call to its resultgen_ai.tool.type— tool category:function,extension, ordatastore
Agent spans
Agent lifecycle uses gen_ai.operation.name=invoke_agent for an agent turn and create_agent for construction, carrying gen_ai.agent.id, gen_ai.agent.name, and gen_ai.conversation.id for session correlation. The spec distinguishes a CLIENT span (agent is a remote service) from an INTERNAL span (agent runs in-process), but both carry the same attribute set, so a single TraceQL filter on gen_ai.operation.name=invoke_agent captures every agent turn regardless of deployment topology. gen_ai.conversation.id is the thread stitching a multi-turn session together — use it as both a metrics dimension (per-conversation latency) and a TraceQL filter (replay one session's full span tree). See the GenAI agent span conventions for the full attribute listing.
Collector pipeline: receiving and routing gen_ai spans
Applications export OTLP spans to the OpenTelemetry Collector. The collector's job is to batch spans, apply attribute policies that keep the gen_ai.* keys you actually dashboard on, and forward to the trace backend. A minimal illustrative configuration:
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch: {}
attributes/genai:
actions:
- key: gen_ai.provider.name
action: upsert
- key: gen_ai.operation.name
action: upsert
- key: gen_ai.request.model
action: upsert
- key: gen_ai.usage.input_tokens
action: upsert
- key: gen_ai.usage.output_tokens
action: upsert
- key: gen_ai.tool.name
action: upsert
- key: gen_ai.conversation.id
action: upsert
exporters:
otlp/tempo:
endpoint: tempo:4317
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, attributes/genai]
exporters: [otlp/tempo]Tune to your environment: add tail-sampling processors if you want to keep all error traces and drop a fraction of successful ones, and add a redaction processor if prompt content must not leave the cluster. These are reference patterns, not validated production configurations.
On the Tempo side, configure the metrics-generator to retain gen_ai.operation.name as a dimension so one RED stream separates inference latency from tool latency from agent-turn latency without re-instrumenting anything. Token counts are attribute values, not durations — chart them via TraceQL metrics over the usage attributes rather than from the duration-based span-metrics stream.
SLI and SLO design for GenAI workloads
An SLI (Service Level Indicator) is a measured ratio of good events to valid events. An SLO (Service Level Objective) is the target that ratio must meet over a window. The error budget is the complement of the SLO — a 99% objective permits 1% of bad events over the window, and consuming that budget is the signal that triggers both automated alerting and engineering action. This model is described in detail in the Google SRE Workbook, Implementing SLOs.
Inference endpoint SLIs
Four signal families cover what callers experience at an inference endpoint:
- Latency — per-request wall time at p50, p95, p99. Report as histograms (not gauges or summaries) to enable correct aggregation across replicas and percentile recomputation after the fact.
- Availability — fraction of valid requests returning a non-error response. Exclude 4xx (caller errors) from the budget; count 5xx and timeouts as bad events.
- Error rate — fraction of requests that fail or time out. Tracked separately from availability for alerting granularity.
- Saturation — how full the serving queue is relative to actively running requests. A leading indicator: saturation predicts latency and error degradation before users observe it.
For streaming LLM endpoints — where tokens arrive incrementally and the user reads the first token long before the last arrives — a single latency number is misleading. Split latency into two separate SLIs:
- Time to first token (TTFT) — from request receipt to the first streamed token. Dominated by prompt-prefill cost and queue wait. This is perceived responsiveness: it is what makes a chat interface feel alive or sluggish.
- Inter-token latency (time per output token) — the decode throughput per generated token. Dominated by output length and batch size. A long prompt with a short answer has high TTFT and low inter-token latency; the reverse pattern requires a different objective.
Agent endpoint SLIs
Agent workloads require additional SLIs beyond what inference endpoints need, because the failure modes are different. A single LLM call either succeeds or fails; an agent trajectory can complete with the wrong answer, use more steps than necessary, or call tools that error without the overall trace indicating a fault. Research on trajectory evaluation — including work on tool-augmented agent benchmarks (see arXiv:2510.02837) — distinguishes efficiency and correctness signals that translate into platform-side SLIs:
- Trajectory success rate — fraction of agent turns that complete without an unhandled error or an abort. Measured from the
invoke_agentspans in Tempo. Note: a specific percentage target (e.g. ≥90%) is workload-specific; calibrate against your own observed distribution before committing an objective. - Tool error rate — fraction of
execute_toolspans ending in an error status. Surfaced directly from Tempo's metrics-generator oncegen_ai.tool.nameis retained as a dimension. - Retrieval relevance — whether the retriever surface returned contextually relevant results. The literature establishes this as a valid quality signal (contextual relevancy, faithfulness scores), but the exact threshold is workload-specific. This SLI is best emitted by the retrieval layer itself as a custom span attribute and surfaced through Tempo, rather than computed by the platform.
- Step count per turn — the number of tool invocations nested within a single
invoke_agentspan. An unexpectedly high step count signals loop behaviour or a degenerate plan, and is a useful saturation signal even before errors appear.
Multi-window burn-rate alerting
Alerting directly on a threshold in a single time window is noisy: a short window pages on every transient blip; a long window takes hours to notice a hard outage. The Google SRE Workbook, Alerting on SLOs describes multi-window, multi-burn-rate alerting: pair a fast-burn alert (short + long window, both breaching) for acute outages with a slow-burn alert (longer windows) for gradual erosion. The short window confirms the condition is still active; the long window suppresses flapping.
Burn rate is how fast the error budget is being consumed relative to spending it evenly across the full window. A burn rate of 1.0 exhausts the budget exactly at the window's end; a rate of 14.4 would exhaust a 30-day budget in roughly two days. The Workbook's worked example for a 99.9% objective produces these pairings:
- Page (fast): 1h long window / 5m short window — burn rate 14.4 — consumes ~2% of budget before firing
- Page (medium): 6h long window / 30m short window — burn rate 6 — consumes ~5% of budget before firing
- Ticket (slow): 3d long window / 6h short window — burn rate 1 — consumes ~10% of budget before firing
Expressed as a concrete PromQL alert for the latency SLI, assuming a 99.9% objective (error budget = 0.001) and recording rules that pre-compute the error ratio:
groups:
- name: genai_slo_alerts
rules:
# Fast-burn page: both 5m and 1h windows exceed 14.4x budget spend
- alert: GenAILatencySLOFastBurn
expr: |
(
slo:genai_latency:error_ratio:rate5m > (14.4 * 0.001)
and
slo:genai_latency:error_ratio:rate1h > (14.4 * 0.001)
)
for: 2m
labels:
severity: page
annotations:
summary: 'GenAI latency SLO fast-burn: exhausting 30-day budget in ~2 days'
# Medium-burn page: 6h and 30m windows both exceed 6x budget spend
- alert: GenAILatencySLOMediumBurn
expr: |
(
slo:genai_latency:error_ratio:rate30m > (6 * 0.001)
and
slo:genai_latency:error_ratio:rate6h > (6 * 0.001)
)
for: 15m
labels:
severity: page
annotations:
summary: 'GenAI latency SLO medium-burn: gradual erosion detected'
# Slow-burn ticket: 3d and 6h windows at 1x budget spend
- alert: GenAILatencySLOSlowBurn
expr: |
(
slo:genai_latency:error_ratio:rate6h > (1 * 0.001)
and
slo:genai_latency:error_ratio:rate3d > (1 * 0.001)
)
for: 1h
labels:
severity: ticket
annotations:
summary: 'GenAI latency SLO slow-burn: budget eroding below baseline'The constants 14.4, 6, and 0.001 assume a 99.9% objective over a 30-day window. Change the objective and every multiplier moves. Re-derive from your own target with the Workbook's formula rather than copying these values.
Two layers, one stack
A recurring source of confusion in GenAI platform engineering is the boundary between platform observability and LLM observability. The distinction is worth stating explicitly:
- Platform observability answers: is the agent pipeline fast, cheap, and up? It reads
gen_ai.*spans in Tempo and derived RED metrics in Prometheus. It owns the collector pipeline, Tempo storage, span-metrics generation, and the latency/throughput/error SLOs. - LLM observability answers: is the agent correct? It reads the same spans but scores them on retrieval relevance, citation faithfulness, and answer quality. It lives downstream of the platform layer and does not duplicate the collection pipeline.
Both layers use the same vocabulary — the GenAI semantic conventions — and the same trace store. The split is ownership, not infrastructure. The platform team sets SLOs on latency and availability; the application team sets quality thresholds on relevance and faithfulness. Neither can substitute for the other.
Adoption checklist
- Pin a semconv version. Set OTEL_SEMCONV_STABILITY_OPT_IN deliberately. Do not let a library upgrade silently rename attributes under running dashboards.
- Preserve gen_ai.* keys through the collector. Add an attributes processor so the keys you dashboard on survive batching and any redaction policy.
- Make gen_ai.operation.name a metrics dimension in Tempo's metrics-generator. One RED stream then separates inference, tool, and agent spans without any re-instrumentation.
- Chart token spend off span attributes, not RED metrics. Token counts are attribute values — use TraceQL metrics over gen_ai.usage.input_tokens and gen_ai.usage.output_tokens.
- Use gen_ai.conversation.id as a metrics dimension for per-conversation latency and as a TraceQL filter for session replay. It is the primary correlation handle for multi-turn agents.
- Set objectives from observed distributions, not assumed baselines. Graph the histogram quantiles over a representative period before committing a target. An objective tighter than current p95 is a backlog item, not a true SLO.
- Re-derive burn-rate multipliers from your own target. The 14.4 / 6 / 1 constants in the examples above assume a 99.9% objective over 30 days — change the objective and every number moves.
References
- OpenTelemetry — Semantic Conventions for Generative AI Systems (gen_ai.* namespace, stability status, attribute listing). OpenTelemetry Authors, 2024–2026. https://opentelemetry.io/docs/specs/semconv/gen-ai/
- OpenTelemetry — Semantic Conventions for GenAI Agent and Framework Spans (invoke_agent, create_agent, gen_ai.conversation.id). OpenTelemetry Authors, 2025–2026. https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
- OpenTelemetry — Semantic Conventions for Generative AI Client Spans (execute_tool, gen_ai.tool.* attributes). OpenTelemetry Authors, 2025–2026. https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
- CNCF — OpenTelemetry Graduation Announcement (11 May 2026). Cloud Native Computing Foundation, 2026.
- CNCF — Prometheus Project Page (graduation status August 2018). Cloud Native Computing Foundation. https://www.cncf.io/projects/prometheus/
- Grafana Tempo — Metrics-Generator (span metrics, RED derivation, service graph). Grafana Labs. https://grafana.com/docs/tempo/latest/metrics-from-traces/metrics-generator/
- Google SRE Workbook — Implementing SLOs (SLI/SLO/error-budget model). Beyer, Jones, Petoff, Murphy (eds.), Google, 2018. https://sre.google/workbook/implementing-slos/
- Google SRE Workbook — Alerting on SLOs (multi-window burn-rate alerting, constants). Google, 2018. https://sre.google/workbook/alerting-on-slos/
- "Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents" (trajectory success rate, step-count, tool-error metrics as evaluation signals). arXiv:2510.02837, 2025. https://arxiv.org/pdf/2510.02837
- OpenTelemetry Collector documentation. OpenTelemetry Authors. https://opentelemetry.io/docs/collector/
Tags
About the Author

asleekgeek
Senior Developer, Architect, DevOps
Owner and main author "ASleekGeek website" #husband #father #software-developer #geek #reader-of-all-things #food-lover #mufc-fan #aspiring-guitarist
Thanks for reading! Explore more articles.
Back to Articles