TraceQL Deep Dive: A Practical Guide to Error Filtering, P99, and Mimir Cross-Signal Queries in Grafana Tempo 2.x
A 12% spike in payment-service error rate, P99 at 4.3 seconds — the time it took to identify the root cause trace in this situation was just 8 seconds. Being able to pinpoint "which request in which service was the problem" within seconds during a distributed systems incident is the possibility that Grafana Tempo 2.x and TraceQL unlock. Just as SQL queries relational data, TraceQL is a dedicated language for precisely querying distributed trace data at the span level. If Prometheus's PromQL is a language for aggregating time-series metrics (the flow of numbers), TraceQL is a language for exploring the full execution path of individual requests — the two languages cover different dimensions of observability. The TraceQL Metrics feature lets you extract PromQL-like aggregations from traces on the fly, bridging both worlds in a single tool.
This post systematically examines three core capabilities of TraceQL — error span filtering, per-service P99 latency analysis, and cross-signal queries with Mimir metrics — with practical examples. Beyond a simple syntax introduction, it includes query patterns and operational know-how you can reach for immediately during real on-call situations.
Prerequisites for this post: This assumes Grafana Tempo is already deployed and your application is instrumented with the OpenTelemetry SDK. If Tempo isn't installed yet, refer to the Grafana Tempo official installation guide first.
Core Concepts
What is TraceQL
TraceQL is a dedicated query language designed for querying distributed tracing data in Grafana Tempo. In the same vein as LogQL for logs and PromQL for metrics, it serves as the purpose-built query language for traces.
Key difference from PromQL: PromQL asks "what was the error rate over the last 5 minutes?", while TraceQL asks "which requests were errors and how long did each span take?" Using the TraceQL Metrics feature, you can extract PromQL-like aggregations from traces on the fly, letting you check P99 or error rates in real time without prior instrumentation.
The basic syntax structure is as follows:
{ <span-filter> } | <pipeline-function>The curly braces {} are the span selector — the conditions inside are applied to each span and return matching spans. After the pipe |, you can chain aggregation functions like rate() and quantile_over_time().
Key Field Namespaces
| Namespace | Description | Example |
|---|---|---|
span.* |
Span attributes | span.http.status_code |
resource.* |
Resource attributes (service metadata) | resource.service.name |
status |
Span status | error, ok, unset |
duration |
Span elapsed time | duration > 500ms |
name |
Span name | name = "POST /api/orders" |
OpenTelemetry Semantic Conventions: Field names such as
span.http.request.methodandresource.service.namefollow OpenTelemetry standard attribute names directly. Applications instrumented with the OTEL SDK can use them without any additional mapping.
Structural Operators (Tempo 2.2+)
Starting with Tempo 2.2, structural operators were introduced for filtering based on hierarchical relationships between spans within a trace.
| Operator | Meaning |
|---|---|
>> (descendant) |
Finds span B among A's descendant spans |
> (child) |
Span B as A's direct child |
<< (ancestor) |
Finds span B among A's ancestor spans |
< (parent) |
Span B as A's direct parent |
~ (sibling) |
Span B at the same level as A |
These operators are extremely useful when tracing "which layer contains the root cause of a given error" directly through the distributed call tree structure.
TraceQL Metrics Evolution by Version
| Version | Released | Key TraceQL Features |
|---|---|---|
| 2.2 | 2023.08 | Structural operators (>>, >, ~, <<, <) |
| 2.4 | 2024.02 | Experimental introduction of TraceQL Metrics (rate()) |
| 2.5 | 2024.06 | Added quantile_over_time(), histogram_over_time(), gRPC streaming |
| 2.6 | 2024.09 | Native histogram, TraceQL Metrics Exemplars, instant queries |
| 2.7 | 2025.01 | New metrics functions such as sum_over_time() |
Practical Application
The five examples that follow progressively deepen a single scenario. payment-service error rate spiking 12%, P99 at 4.3 seconds — we'll walk through the entire process from the moment you receive an on-call alert, to identifying the root cause, to building long-term alert rules, all with TraceQL.
Error Span Filtering: Immediately Identify SLO-Violating Traces
Right after receiving an on-call alert, the very first thing to do is quickly pinpoint the traces where errors are occurring.
Step 1 — Basic service error filter:
{ resource.service.name = "payment-service" && status = error }You can also filter by HTTP status code. status = error checks the OTEL span status, while span.http.status_code >= 400 checks the HTTP protocol code — they verify different layers.
{ span.http.status_code >= 400 }Step 2 — Compound query applying SLO conditions simultaneously:
{ resource.service.name = "payment-service"
&& status = error
&& duration > 2s
&& span.http.route = "/api/v1/orders" }Output of this query: Individual traces satisfying the conditions are listed in the trace list view in Grafana Explore. Clicking each trace lets you inspect the full span hierarchy and attributes.
| Condition | Meaning |
|---|---|
status = error |
Only spans with OTEL span status Error |
duration > 2s |
Exceeds SLO threshold (2 seconds) |
span.http.route |
Scoped to a specific endpoint |
Tracing Error Propagation: Finding Root Causes with Structural Operators
An error occurred in payment-service, but the actual cause may lie in a downstream dependency. Use structural operators to trace the error propagation path.
{ resource.service.name = "payment-service" && status = error }
>> { span.db.system = "postgresql" }How the
>>operator works: Within a trace that has a span matching the left condition as an ancestor, it returns descendant spans matching the right condition. It lets you explore in one shot whether "a PostgreSQL span exists beneath a payment-service error."
More broadly, you can use the same approach to check whether an error in the frontend originates from payment-service.
{ resource.service.name = "frontend" && status = error }
>> { resource.service.name = "payment-service" }Per-Service P99 Latency Analysis: Identifying Bottleneck Services Without Instrumentation
Once you've narrowed down the source of the error, quantitatively understand the latency distribution. Using quantile_over_time(), available since Tempo 2.5, you can extract P99 in real time without any additional instrumentation.
At-a-glance P99 latency across all services:
{ resource.service.name =~ ".+" }
| quantile_over_time(duration, 1m, 0.99)
by (resource.service.name)
.+matches any non-empty string. It targets all services that have aservice.nameattribute.
Focused P99 latency check for payment-service:
{ resource.service.name = "payment-service" }
| quantile_over_time(duration, 1m, 0.99)Output of this query: Rendered as a time-series graph in Grafana Explore. The X-axis is time and the Y-axis is latency (seconds), letting you visually inspect P99 trends.
Note on multi-percentile queries: The syntax for specifying multiple percentiles at once in the form
quantile_over_time(duration, 5m, 0.50, 0.90, 0.99)may vary in support depending on the Tempo version. It is recommended to check the signature in the official function reference for your Tempo version first.
Checking the latency distribution for error spans only:
{ status = error }
| histogram_over_time(duration, 1m)
by (resource.service.name)Tracking error rate by service:
{ status = error }
| rate()
by (resource.service.name, span.http.route)TraceQL Metrics time range limitation: TraceQL Metrics queries support a maximum range of 24 hours. For long-term trend analysis or alert rule configuration, you'll need the metrics-generator + Mimir combination described next.
Mimir Cross-Signal: Bidirectional Trace-Metric Drill-Down
The LGTM stack refers to Grafana's open-source observability stack combining Loki (logs), Grafana (visualization), Tempo (traces), and Mimir (metrics). Because each signal is organically connected, the flow from detecting a metric anomaly → drilling down into the corresponding trace → checking related logs forms one seamless workflow.
The key component enabling this connection is metrics-generator. It is an internal Tempo component that continuously converts collected trace streams into PromQL-compatible metrics in real time and remote-writes them to Mimir. This component must be enabled to use all three cross-signal methods below.
Method 1 — Exemplars: Drill down from metrics → traces
metrics-generator automatically generates Exemplars, linking a traceID to Mimir metric data points. Add the following to your Grafana datasource configuration:
# Grafana datasource configuration (Mimir/Prometheus)
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo_uidNote on configuration field names: Depending on your Grafana version and configuration method (YAML, UI, or JSON provisioning), the
exemplarTraceIdDestinations(camelCase) form is used. It is recommended to verify against the official documentation for the Grafana version you are using.
After configuration, clicking a latency spike point in the Mimir dashboard will automatically open the corresponding trace in Tempo.
Method 2 — Trace to Metrics: Jump from traces → metrics
With the "Trace to metrics" link configuration in the Tempo datasource, you can navigate directly from a specific trace view to the related Mimir metrics.
# Grafana Tempo datasource configuration
traceToMetrics:
datasourceUid: mimir_uid
queries:
- name: "Request Rate"
query: 'rate(traces_spanmetrics_calls_total{service_name="${__span.resource.service.name}"}[5m])'
- name: "P99 Latency"
query: 'histogram_quantile(0.99, rate(traces_spanmetrics_duration_seconds_bucket{service_name="${__span.resource.service.name}"}[5m]))'Note on variable syntax: The
${__span.resource.service.name}variable may be written as${__span.tags.service.name}depending on your Grafana version. Verify the exact syntax in the official Trace to Metrics documentation for your Grafana version.
Method 3 — metrics-generator: Store traces as long-term metrics
# tempo.yaml
metrics_generator:
processor:
service_graphs:
enabled: true
span_metrics:
enabled: true
storage:
remote_write:
- url: http://mimir:9009/api/v1/push
# For production-required settings like TLS and auth, refer to the official docs:
# https://grafana.com/docs/tempo/latest/configuration/#metrics_generatorKey metrics generated:
| Metric Name | Description |
|---|---|
traces_spanmetrics_calls_total |
Call count |
traces_spanmetrics_duration_seconds_bucket |
Latency histogram |
traces_service_graph_request_total |
Request count between services |
traces_service_graph_request_failed_total |
Failure count between services |
Querying payment-service P99 in Mimir with PromQL:
histogram_quantile(0.99,
rate(traces_spanmetrics_duration_seconds_bucket{
service_name="payment-service"
}[5m])
)Note on label names: Actual label keys such as
service_namemay differ depending on your metrics-generator configuration and environment. It is recommended to first query thetraces_spanmetrics_calls_totalmetric in Grafana Explore to confirm the actual label keys before writing your query.
Ad-hoc RED Metrics Analysis: Your Entire Service Landscape at a Glance
Once an incident is resolved, if you want to understand the state of all your services, a single line is enough.
This is the core pattern for TraceQL ad-hoc analysis.
{ resource.service.name =~ ".+" }
| rate() by (resource.service.name, status)Output of this query: You can view the request rate per service and status (ok/error/unset) as a time-series graph. The moment the error line for payment-service spiked is immediately visible.
With this one line, you can check the Rate and Error status of all services in real time with no additional instrumentation code. Adding quantile_over_time() completes all three RED metrics, including Duration.
RED Metrics: An observability methodology that uses Rate, Errors, and Duration as the three key service health indicators. TraceQL's
rate()andquantile_over_time()let you extract all three without any additional instrumentation.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Ad-hoc flexibility | Real-time aggregation on arbitrary attributes without prior instrumentation |
| Rich context | Per-request drill-down impossible with metrics alone |
| Structural queries | Explore complex distributed call patterns with span hierarchy-based filtering |
| Full OTEL compatibility | Use OpenTelemetry semantic convention attributes directly |
| Cross-signal | Seamless switching between Mimir ↔ Tempo via Exemplars |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Time range limitation | TraceQL Metrics max 24 hours | Long-term retention via metrics-generator → Mimir |
| No alerting | TraceQL Metrics not supported as a Grafana Managed Alerts source | Write alert rules using traces_spanmetrics_* metrics |
| Query cost | Complex queries on large-scale traces are resource-intensive | Limit time ranges, optimize attribute indexing |
| Sampling impact | Sampling can reduce representativeness of error span filter results | Configure tail-based sampling for 100% error span collection |
| vParquet4 migration | Event/link queries only supported on vParquet4 blocks | Consider upgrading to Tempo 2.6+ and regenerating blocks |
| Cardinality explosion | Using high-cardinality attributes (e.g., user_id) in by() causes excessive memory use |
Use only low-cardinality attributes (e.g., service.name) in by() |
Cardinality: The number of unique values a given label can hold. Using attributes with millions of unique values, like
user_id, as aggregation keys can cause system memory to spike dramatically.
vParquet4: The storage format for Grafana Tempo. It became the default starting with Tempo 2.6 and adds additional support for event, link, and array columns.
Most Common Mistakes in Practice
- Attempting to configure alerts with TraceQL Metrics: TraceQL Metrics is for exploration, not for use as an alert source. It is recommended to always write alerts based on
traces_spanmetrics_*metrics stored in Mimir by metrics-generator. - Using high-cardinality attributes in the
by()clause: Using attributes with millions of unique values as aggregation keys, likeby(span.user_id), can lead to a memory explosion. It is safe to use attributes with tens to hundreds of unique values, such asresource.service.nameandspan.http.route. - Attempting TraceQL Metrics queries beyond 24 hours: Queries that exceed the range will fail or have their results truncated. For cases requiring long-term analysis, it is better to design with PromQL queries against Mimir from the start.
Closing Thoughts
What we covered today: Through a payment-service incident response scenario, we explored error span filtering, root cause tracing with structural operators, real-time P99 extraction with TraceQL Metrics, long-term metric integration with Mimir via metrics-generator, and the Exemplar-based bidirectional metric-trace drill-down workflow. TraceQL bridges individual request context and time-series aggregate analysis in a single tool, through the paradigm shift of "extracting metrics from traces."
Three steps you can start with right now:
- Run a basic error filter in Grafana Explore: Select the Tempo datasource, type
{ status = error }, and add| rate() by (resource.service.name)to visualize the error trend by service. (If the Tempo datasource isn't connected yet, refer to the official datasource configuration guide.) - Enable metrics-generator: Adding
metrics_generator.processor.span_metrics.enabled: trueand the Mimirremote_writeconfiguration totempo.yamlwill enable you to write P99/error rate alert rules in PromQL going forward. - Configure Exemplar integration: Adding
exemplarTraceIdDestinationsto your Mimir/Prometheus datasource settings in Grafana gives you the workflow of jumping to the root cause trace with a single click from a latency spike point.
Next in the Series
We'll cover how to achieve both 100% error span collection and cost optimization simultaneously by configuring tail-based sampling in Grafana Tempo.
References
- TraceQL Official Documentation | Grafana Tempo
- TraceQL Query Construction Guide | Grafana Tempo
- TraceQL Metrics Function Reference | Grafana Tempo
- TraceQL Metrics Queries | Grafana Tempo
- Metrics from Traces | Grafana Tempo
- Grafana Tempo 2.4 Release: Introducing TraceQL Metrics | Grafana Blog
- Grafana Tempo 2.5 Release: vParquet4, Streaming, quantile_over_time | Grafana Blog
- Grafana Tempo 2.6 Release | Grafana Blog
- Grafana Tempo 2.7 Release | Grafana Blog
- Tempo 2.2: Introducing Structural Operators | Grafana Blog
- Configure Trace to Metrics | Grafana Official Documentation
- Solving Problems with TraceQL | Grafana Tempo
- Traces to Metrics Ad-hoc Queries | Grafana Blog
- Advanced TraceQL Tutorial | Giant Swarm
- Service Graph Metrics Analysis | Grafana Tempo