TraceQL Deep Dive: A Practical Guide to Error Filtering, P99, and Mimir Cross-Signal Queries in Grafana Tempo 2.x

A 12% spike in payment-service error rate, P99 at 4.3 seconds — the time it took to identify the root cause trace in this situation was just 8 seconds. Being able to pinpoint "which request in which service was the problem" within seconds during a distributed systems incident is the possibility that Grafana Tempo 2.x and TraceQL unlock. Just as SQL queries relational data, TraceQL is a dedicated language for precisely querying distributed trace data at the span level. If Prometheus's PromQL is a language for aggregating time-series metrics (the flow of numbers), TraceQL is a language for exploring the full execution path of individual requests — the two languages cover different dimensions of observability. The TraceQL Metrics feature lets you extract PromQL-like aggregations from traces on the fly, bridging both worlds in a single tool.

This post systematically examines three core capabilities of TraceQL — error span filtering, per-service P99 latency analysis, and cross-signal queries with Mimir metrics — with practical examples. Beyond a simple syntax introduction, it includes query patterns and operational know-how you can reach for immediately during real on-call situations.

Prerequisites for this post: This assumes Grafana Tempo is already deployed and your application is instrumented with the OpenTelemetry SDK. If Tempo isn't installed yet, refer to the Grafana Tempo official installation guide first.

Core Concepts

What is TraceQL

TraceQL is a dedicated query language designed for querying distributed tracing data in Grafana Tempo. In the same vein as LogQL for logs and PromQL for metrics, it serves as the purpose-built query language for traces.

Key difference from PromQL: PromQL asks "what was the error rate over the last 5 minutes?", while TraceQL asks "which requests were errors and how long did each span take?" Using the TraceQL Metrics feature, you can extract PromQL-like aggregations from traces on the fly, letting you check P99 or error rates in real time without prior instrumentation.

The basic syntax structure is as follows:

traceql

{ <span-filter> } | <pipeline-function>

The curly braces {} are the span selector — the conditions inside are applied to each span and return matching spans. After the pipe |, you can chain aggregation functions like rate() and quantile_over_time().

Key Field Namespaces

Namespace	Description	Example
`span.*`	Span attributes	`span.http.status_code`
`resource.*`	Resource attributes (service metadata)	`resource.service.name`
`status`	Span status	`error`, `ok`, `unset`
`duration`	Span elapsed time	`duration > 500ms`
`name`	Span name	`name = "POST /api/orders"`

OpenTelemetry Semantic Conventions: Field names such as span.http.request.method and resource.service.name follow OpenTelemetry standard attribute names directly. Applications instrumented with the OTEL SDK can use them without any additional mapping.

Structural Operators (Tempo 2.2+)

Starting with Tempo 2.2, structural operators were introduced for filtering based on hierarchical relationships between spans within a trace.

Operator	Meaning
`>>` (descendant)	Finds span B among A's descendant spans
`>` (child)	Span B as A's direct child
`<<` (ancestor)	Finds span B among A's ancestor spans
`<` (parent)	Span B as A's direct parent
`~` (sibling)	Span B at the same level as A

These operators are extremely useful when tracing "which layer contains the root cause of a given error" directly through the distributed call tree structure.

TraceQL Metrics Evolution by Version

Version	Released	Key TraceQL Features
2.2	2023.08	Structural operators (`>>`, `>`, `~`, `<<`, `<`)
2.4	2024.02	Experimental introduction of TraceQL Metrics (`rate()`)
2.5	2024.06	Added `quantile_over_time()`, `histogram_over_time()`, gRPC streaming
2.6	2024.09	Native histogram, TraceQL Metrics Exemplars, instant queries
2.7	2025.01	New metrics functions such as `sum_over_time()`

Practical Application

The five examples that follow progressively deepen a single scenario. payment-service error rate spiking 12%, P99 at 4.3 seconds — we'll walk through the entire process from the moment you receive an on-call alert, to identifying the root cause, to building long-term alert rules, all with TraceQL.

Error Span Filtering: Immediately Identify SLO-Violating Traces

Right after receiving an on-call alert, the very first thing to do is quickly pinpoint the traces where errors are occurring.

Step 1 — Basic service error filter:

traceql

{ resource.service.name = "payment-service" && status = error }

You can also filter by HTTP status code. status = error checks the OTEL span status, while span.http.status_code >= 400 checks the HTTP protocol code — they verify different layers.

traceql

{ span.http.status_code >= 400 }

Step 2 — Compound query applying SLO conditions simultaneously:

traceql

{ resource.service.name = "payment-service"
  && status = error
  && duration > 2s
  && span.http.route = "/api/v1/orders" }

Output of this query: Individual traces satisfying the conditions are listed in the trace list view in Grafana Explore. Clicking each trace lets you inspect the full span hierarchy and attributes.

Condition	Meaning
`status = error`	Only spans with OTEL span status Error
`duration > 2s`	Exceeds SLO threshold (2 seconds)
`span.http.route`	Scoped to a specific endpoint

Tracing Error Propagation: Finding Root Causes with Structural Operators

An error occurred in payment-service, but the actual cause may lie in a downstream dependency. Use structural operators to trace the error propagation path.

traceql

{ resource.service.name = "payment-service" && status = error }
>> { span.db.system = "postgresql" }

How the >> operator works: Within a trace that has a span matching the left condition as an ancestor, it returns descendant spans matching the right condition. It lets you explore in one shot whether "a PostgreSQL span exists beneath a payment-service error."

More broadly, you can use the same approach to check whether an error in the frontend originates from payment-service.

traceql

{ resource.service.name = "frontend" && status = error }
>> { resource.service.name = "payment-service" }

Per-Service P99 Latency Analysis: Identifying Bottleneck Services Without Instrumentation

Once you've narrowed down the source of the error, quantitatively understand the latency distribution. Using quantile_over_time(), available since Tempo 2.5, you can extract P99 in real time without any additional instrumentation.

At-a-glance P99 latency across all services:

traceql

{ resource.service.name =~ ".+" }
| quantile_over_time(duration, 1m, 0.99)
  by (resource.service.name)

.+ matches any non-empty string. It targets all services that have a service.name attribute.

Focused P99 latency check for payment-service:

traceql

{ resource.service.name = "payment-service" }
| quantile_over_time(duration, 1m, 0.99)

Output of this query: Rendered as a time-series graph in Grafana Explore. The X-axis is time and the Y-axis is latency (seconds), letting you visually inspect P99 trends.

Note on multi-percentile queries: The syntax for specifying multiple percentiles at once in the form quantile_over_time(duration, 5m, 0.50, 0.90, 0.99) may vary in support depending on the Tempo version. It is recommended to check the signature in the official function reference for your Tempo version first.

Checking the latency distribution for error spans only:

traceql

{ status = error }
| histogram_over_time(duration, 1m)
  by (resource.service.name)

Tracking error rate by service:

traceql

{ status = error }
| rate()
  by (resource.service.name, span.http.route)

TraceQL Metrics time range limitation: TraceQL Metrics queries support a maximum range of 24 hours. For long-term trend analysis or alert rule configuration, you'll need the metrics-generator + Mimir combination described next.

Mimir Cross-Signal: Bidirectional Trace-Metric Drill-Down

The LGTM stack refers to Grafana's open-source observability stack combining Loki (logs), Grafana (visualization), Tempo (traces), and Mimir (metrics). Because each signal is organically connected, the flow from detecting a metric anomaly → drilling down into the corresponding trace → checking related logs forms one seamless workflow.

The key component enabling this connection is metrics-generator. It is an internal Tempo component that continuously converts collected trace streams into PromQL-compatible metrics in real time and remote-writes them to Mimir. This component must be enabled to use all three cross-signal methods below.

Method 1 — Exemplars: Drill down from metrics → traces

metrics-generator automatically generates Exemplars, linking a traceID to Mimir metric data points. Add the following to your Grafana datasource configuration:

yaml

# Grafana datasource configuration (Mimir/Prometheus)
exemplarTraceIdDestinations:
  - name: traceID
    datasourceUid: tempo_uid

Note on configuration field names: Depending on your Grafana version and configuration method (YAML, UI, or JSON provisioning), the exemplarTraceIdDestinations (camelCase) form is used. It is recommended to verify against the official documentation for the Grafana version you are using.

After configuration, clicking a latency spike point in the Mimir dashboard will automatically open the corresponding trace in Tempo.

Method 2 — Trace to Metrics: Jump from traces → metrics

With the "Trace to metrics" link configuration in the Tempo datasource, you can navigate directly from a specific trace view to the related Mimir metrics.

yaml

# Grafana Tempo datasource configuration
traceToMetrics:
  datasourceUid: mimir_uid
  queries:
    - name: "Request Rate"
      query: 'rate(traces_spanmetrics_calls_total{service_name="${__span.resource.service.name}"}[5m])'
    - name: "P99 Latency"
      query: 'histogram_quantile(0.99, rate(traces_spanmetrics_duration_seconds_bucket{service_name="${__span.resource.service.name}"}[5m]))'

Note on variable syntax: The ${__span.resource.service.name} variable may be written as ${__span.tags.service.name} depending on your Grafana version. Verify the exact syntax in the official Trace to Metrics documentation for your Grafana version.

Method 3 — metrics-generator: Store traces as long-term metrics

yaml

# tempo.yaml
metrics_generator:
  processor:
    service_graphs:
      enabled: true
    span_metrics:
      enabled: true
  storage:
    remote_write:
      - url: http://mimir:9009/api/v1/push
        # For production-required settings like TLS and auth, refer to the official docs:
        # https://grafana.com/docs/tempo/latest/configuration/#metrics_generator

Key metrics generated:

Metric Name	Description
`traces_spanmetrics_calls_total`	Call count
`traces_spanmetrics_duration_seconds_bucket`	Latency histogram
`traces_service_graph_request_total`	Request count between services
`traces_service_graph_request_failed_total`	Failure count between services

Querying payment-service P99 in Mimir with PromQL:

promql

histogram_quantile(0.99,
  rate(traces_spanmetrics_duration_seconds_bucket{
    service_name="payment-service"
  }[5m])
)

Note on label names: Actual label keys such as service_name may differ depending on your metrics-generator configuration and environment. It is recommended to first query the traces_spanmetrics_calls_total metric in Grafana Explore to confirm the actual label keys before writing your query.

Ad-hoc RED Metrics Analysis: Your Entire Service Landscape at a Glance

Once an incident is resolved, if you want to understand the state of all your services, a single line is enough.

This is the core pattern for TraceQL ad-hoc analysis.

traceql

{ resource.service.name =~ ".+" }
| rate() by (resource.service.name, status)

Output of this query: You can view the request rate per service and status (ok/error/unset) as a time-series graph. The moment the error line for payment-service spiked is immediately visible.

With this one line, you can check the Rate and Error status of all services in real time with no additional instrumentation code. Adding quantile_over_time() completes all three RED metrics, including Duration.

RED Metrics: An observability methodology that uses Rate, Errors, and Duration as the three key service health indicators. TraceQL's rate() and quantile_over_time() let you extract all three without any additional instrumentation.

Pros and Cons

Advantages

Item	Details
Ad-hoc flexibility	Real-time aggregation on arbitrary attributes without prior instrumentation
Rich context	Per-request drill-down impossible with metrics alone
Structural queries	Explore complex distributed call patterns with span hierarchy-based filtering
Full OTEL compatibility	Use OpenTelemetry semantic convention attributes directly
Cross-signal	Seamless switching between Mimir ↔ Tempo via Exemplars

Disadvantages and Caveats

Item	Details	Mitigation
Time range limitation	TraceQL Metrics max 24 hours	Long-term retention via metrics-generator → Mimir
No alerting	TraceQL Metrics not supported as a Grafana Managed Alerts source	Write alert rules using `traces_spanmetrics_*` metrics
Query cost	Complex queries on large-scale traces are resource-intensive	Limit time ranges, optimize attribute indexing
Sampling impact	Sampling can reduce representativeness of error span filter results	Configure tail-based sampling for 100% error span collection
vParquet4 migration	Event/link queries only supported on vParquet4 blocks	Consider upgrading to Tempo 2.6+ and regenerating blocks
Cardinality explosion	Using high-cardinality attributes (e.g., user_id) in `by()` causes excessive memory use	Use only low-cardinality attributes (e.g., service.name) in `by()`

Cardinality: The number of unique values a given label can hold. Using attributes with millions of unique values, like user_id, as aggregation keys can cause system memory to spike dramatically.

vParquet4: The storage format for Grafana Tempo. It became the default starting with Tempo 2.6 and adds additional support for event, link, and array columns.

Most Common Mistakes in Practice

Attempting to configure alerts with TraceQL Metrics: TraceQL Metrics is for exploration, not for use as an alert source. It is recommended to always write alerts based on traces_spanmetrics_* metrics stored in Mimir by metrics-generator.
Using high-cardinality attributes in the by() clause: Using attributes with millions of unique values as aggregation keys, like by(span.user_id), can lead to a memory explosion. It is safe to use attributes with tens to hundreds of unique values, such as resource.service.name and span.http.route.
Attempting TraceQL Metrics queries beyond 24 hours: Queries that exceed the range will fail or have their results truncated. For cases requiring long-term analysis, it is better to design with PromQL queries against Mimir from the start.

Closing Thoughts

What we covered today: Through a payment-service incident response scenario, we explored error span filtering, root cause tracing with structural operators, real-time P99 extraction with TraceQL Metrics, long-term metric integration with Mimir via metrics-generator, and the Exemplar-based bidirectional metric-trace drill-down workflow. TraceQL bridges individual request context and time-series aggregate analysis in a single tool, through the paradigm shift of "extracting metrics from traces."

Three steps you can start with right now:

Run a basic error filter in Grafana Explore: Select the Tempo datasource, type { status = error }, and add | rate() by (resource.service.name) to visualize the error trend by service. (If the Tempo datasource isn't connected yet, refer to the official datasource configuration guide.)
Enable metrics-generator: Adding metrics_generator.processor.span_metrics.enabled: true and the Mimir remote_write configuration to tempo.yaml will enable you to write P99/error rate alert rules in PromQL going forward.
Configure Exemplar integration: Adding exemplarTraceIdDestinations to your Mimir/Prometheus datasource settings in Grafana gives you the workflow of jumping to the root cause trace with a single click from a latency spike point.

Next in the Series

We'll cover how to achieve both 100% error span collection and cost optimization simultaneously by configuring tail-based sampling in Grafana Tempo.

References

TraceQL Deep Dive: A Practical Guide to Error Filtering, P99, and Mimir Cross-Signal Queries in Grafana Tempo 2.x | DEV BAK - 기술블로그

DevOps

TraceQL Deep Dive: A Practical Guide to Error Filtering, P99, and Mimir Cross-Signal Queries in Grafana Tempo 2.x

Prerequisites for this post: This assumes Grafana Tempo is already deployed and your application is instrumented with the OpenTelemetry SDK. If Tempo isn't installed yet, refer to the Grafana Tempo official installation guide first.

Core Concepts

What is TraceQL

The basic syntax structure is as follows:

traceql

{ <span-filter> } | <pipeline-function>

Key Field Namespaces

Namespace	Description	Example
`span.*`	Span attributes	`span.http.status_code`
`resource.*`	Resource attributes (service metadata)	`resource.service.name`
`status`	Span status	`error`, `ok`, `unset`
`duration`	Span elapsed time	`duration > 500ms`
`name`	Span name	`name = "POST /api/orders"`

OpenTelemetry Semantic Conventions: Field names such as span.http.request.method and resource.service.name follow OpenTelemetry standard attribute names directly. Applications instrumented with the OTEL SDK can use them without any additional mapping.

Structural Operators (Tempo 2.2+)

Starting with Tempo 2.2, structural operators were introduced for filtering based on hierarchical relationships between spans within a trace.

Operator	Meaning
`>>` (descendant)	Finds span B among A's descendant spans
`>` (child)	Span B as A's direct child
`<<` (ancestor)	Finds span B among A's ancestor spans
`<` (parent)	Span B as A's direct parent
`~` (sibling)	Span B at the same level as A

These operators are extremely useful when tracing "which layer contains the root cause of a given error" directly through the distributed call tree structure.

TraceQL Metrics Evolution by Version

Version	Released	Key TraceQL Features
2.2	2023.08	Structural operators (`>>`, `>`, `~`, `<<`, `<`)
2.4	2024.02	Experimental introduction of TraceQL Metrics (`rate()`)
2.5	2024.06	Added `quantile_over_time()`, `histogram_over_time()`, gRPC streaming
2.6	2024.09	Native histogram, TraceQL Metrics Exemplars, instant queries
2.7	2025.01	New metrics functions such as `sum_over_time()`

Practical Application

Error Span Filtering: Immediately Identify SLO-Violating Traces

Right after receiving an on-call alert, the very first thing to do is quickly pinpoint the traces where errors are occurring.

Step 1 — Basic service error filter:

traceql

{ resource.service.name = "payment-service" && status = error }

You can also filter by HTTP status code. status = error checks the OTEL span status, while span.http.status_code >= 400 checks the HTTP protocol code — they verify different layers.

traceql

{ span.http.status_code >= 400 }

Step 2 — Compound query applying SLO conditions simultaneously:

traceql

{ resource.service.name = "payment-service"
  && status = error
  && duration > 2s
  && span.http.route = "/api/v1/orders" }

Output of this query: Individual traces satisfying the conditions are listed in the trace list view in Grafana Explore. Clicking each trace lets you inspect the full span hierarchy and attributes.

Condition	Meaning
`status = error`	Only spans with OTEL span status Error
`duration > 2s`	Exceeds SLO threshold (2 seconds)
`span.http.route`	Scoped to a specific endpoint

Tracing Error Propagation: Finding Root Causes with Structural Operators

An error occurred in payment-service, but the actual cause may lie in a downstream dependency. Use structural operators to trace the error propagation path.

traceql

{ resource.service.name = "payment-service" && status = error }
>> { span.db.system = "postgresql" }

How the >> operator works: Within a trace that has a span matching the left condition as an ancestor, it returns descendant spans matching the right condition. It lets you explore in one shot whether "a PostgreSQL span exists beneath a payment-service error."

More broadly, you can use the same approach to check whether an error in the frontend originates from payment-service.

traceql

{ resource.service.name = "frontend" && status = error }
>> { resource.service.name = "payment-service" }

Per-Service P99 Latency Analysis: Identifying Bottleneck Services Without Instrumentation

At-a-glance P99 latency across all services:

traceql

{ resource.service.name =~ ".+" }
| quantile_over_time(duration, 1m, 0.99)
  by (resource.service.name)

.+ matches any non-empty string. It targets all services that have a service.name attribute.

Focused P99 latency check for payment-service:

traceql

{ resource.service.name = "payment-service" }
| quantile_over_time(duration, 1m, 0.99)

Output of this query: Rendered as a time-series graph in Grafana Explore. The X-axis is time and the Y-axis is latency (seconds), letting you visually inspect P99 trends.

Note on multi-percentile queries: The syntax for specifying multiple percentiles at once in the form quantile_over_time(duration, 5m, 0.50, 0.90, 0.99) may vary in support depending on the Tempo version. It is recommended to check the signature in the official function reference for your Tempo version first.

Checking the latency distribution for error spans only:

traceql

{ status = error }
| histogram_over_time(duration, 1m)
  by (resource.service.name)

Tracking error rate by service:

traceql

{ status = error }
| rate()
  by (resource.service.name, span.http.route)

TraceQL Metrics time range limitation: TraceQL Metrics queries support a maximum range of 24 hours. For long-term trend analysis or alert rule configuration, you'll need the metrics-generator + Mimir combination described next.

Mimir Cross-Signal: Bidirectional Trace-Metric Drill-Down

Method 1 — Exemplars: Drill down from metrics → traces

metrics-generator automatically generates Exemplars, linking a traceID to Mimir metric data points. Add the following to your Grafana datasource configuration:

yaml

# Grafana datasource configuration (Mimir/Prometheus)
exemplarTraceIdDestinations:
  - name: traceID
    datasourceUid: tempo_uid

Note on configuration field names: Depending on your Grafana version and configuration method (YAML, UI, or JSON provisioning), the exemplarTraceIdDestinations (camelCase) form is used. It is recommended to verify against the official documentation for the Grafana version you are using.

After configuration, clicking a latency spike point in the Mimir dashboard will automatically open the corresponding trace in Tempo.

Method 2 — Trace to Metrics: Jump from traces → metrics

With the "Trace to metrics" link configuration in the Tempo datasource, you can navigate directly from a specific trace view to the related Mimir metrics.

yaml

# Grafana Tempo datasource configuration
traceToMetrics:
  datasourceUid: mimir_uid
  queries:
    - name: "Request Rate"
      query: 'rate(traces_spanmetrics_calls_total{service_name="${__span.resource.service.name}"}[5m])'
    - name: "P99 Latency"
      query: 'histogram_quantile(0.99, rate(traces_spanmetrics_duration_seconds_bucket{service_name="${__span.resource.service.name}"}[5m]))'

Note on variable syntax: The ${__span.resource.service.name} variable may be written as ${__span.tags.service.name} depending on your Grafana version. Verify the exact syntax in the official Trace to Metrics documentation for your Grafana version.

Method 3 — metrics-generator: Store traces as long-term metrics

yaml

# tempo.yaml
metrics_generator:
  processor:
    service_graphs:
      enabled: true
    span_metrics:
      enabled: true
  storage:
    remote_write:
      - url: http://mimir:9009/api/v1/push
        # For production-required settings like TLS and auth, refer to the official docs:
        # https://grafana.com/docs/tempo/latest/configuration/#metrics_generator

Key metrics generated:

Metric Name	Description
`traces_spanmetrics_calls_total`	Call count
`traces_spanmetrics_duration_seconds_bucket`	Latency histogram
`traces_service_graph_request_total`	Request count between services
`traces_service_graph_request_failed_total`	Failure count between services

Querying payment-service P99 in Mimir with PromQL:

promql

histogram_quantile(0.99,
  rate(traces_spanmetrics_duration_seconds_bucket{
    service_name="payment-service"
  }[5m])
)

Note on label names: Actual label keys such as service_name may differ depending on your metrics-generator configuration and environment. It is recommended to first query the traces_spanmetrics_calls_total metric in Grafana Explore to confirm the actual label keys before writing your query.

Ad-hoc RED Metrics Analysis: Your Entire Service Landscape at a Glance

Once an incident is resolved, if you want to understand the state of all your services, a single line is enough.

This is the core pattern for TraceQL ad-hoc analysis.

traceql

{ resource.service.name =~ ".+" }
| rate() by (resource.service.name, status)

Output of this query: You can view the request rate per service and status (ok/error/unset) as a time-series graph. The moment the error line for payment-service spiked is immediately visible.

RED Metrics: An observability methodology that uses Rate, Errors, and Duration as the three key service health indicators. TraceQL's rate() and quantile_over_time() let you extract all three without any additional instrumentation.

Pros and Cons

Advantages

Item	Details
Ad-hoc flexibility	Real-time aggregation on arbitrary attributes without prior instrumentation
Rich context	Per-request drill-down impossible with metrics alone
Structural queries	Explore complex distributed call patterns with span hierarchy-based filtering
Full OTEL compatibility	Use OpenTelemetry semantic convention attributes directly
Cross-signal	Seamless switching between Mimir ↔ Tempo via Exemplars

Disadvantages and Caveats

Item	Details	Mitigation
Time range limitation	TraceQL Metrics max 24 hours	Long-term retention via metrics-generator → Mimir
No alerting	TraceQL Metrics not supported as a Grafana Managed Alerts source	Write alert rules using `traces_spanmetrics_*` metrics
Query cost	Complex queries on large-scale traces are resource-intensive	Limit time ranges, optimize attribute indexing
Sampling impact	Sampling can reduce representativeness of error span filter results	Configure tail-based sampling for 100% error span collection
vParquet4 migration	Event/link queries only supported on vParquet4 blocks	Consider upgrading to Tempo 2.6+ and regenerating blocks
Cardinality explosion	Using high-cardinality attributes (e.g., user_id) in `by()` causes excessive memory use	Use only low-cardinality attributes (e.g., service.name) in `by()`

Cardinality: The number of unique values a given label can hold. Using attributes with millions of unique values, like user_id, as aggregation keys can cause system memory to spike dramatically.

vParquet4: The storage format for Grafana Tempo. It became the default starting with Tempo 2.6 and adds additional support for event, link, and array columns.

Most Common Mistakes in Practice

Attempting to configure alerts with TraceQL Metrics: TraceQL Metrics is for exploration, not for use as an alert source. It is recommended to always write alerts based on traces_spanmetrics_* metrics stored in Mimir by metrics-generator.
Using high-cardinality attributes in the by() clause: Using attributes with millions of unique values as aggregation keys, like by(span.user_id), can lead to a memory explosion. It is safe to use attributes with tens to hundreds of unique values, such as resource.service.name and span.http.route.
Attempting TraceQL Metrics queries beyond 24 hours: Queries that exceed the range will fail or have their results truncated. For cases requiring long-term analysis, it is better to design with PromQL queries against Mimir from the start.

Closing Thoughts

Three steps you can start with right now:

Run a basic error filter in Grafana Explore: Select the Tempo datasource, type { status = error }, and add | rate() by (resource.service.name) to visualize the error trend by service. (If the Tempo datasource isn't connected yet, refer to the official datasource configuration guide.)
Enable metrics-generator: Adding metrics_generator.processor.span_metrics.enabled: true and the Mimir remote_write configuration to tempo.yaml will enable you to write P99/error rate alert rules in PromQL going forward.
Configure Exemplar integration: Adding exemplarTraceIdDestinations to your Mimir/Prometheus datasource settings in Grafana gives you the workflow of jumping to the root cause trace with a single click from a latency spike point.

Next in the Series

We'll cover how to achieve both 100% error span collection and cost optimization simultaneously by configuring tail-based sampling in Grafana Tempo.

Core Concepts

What is TraceQL

Key Field Namespaces

Structural Operators (Tempo 2.2+)

TraceQL Metrics Evolution by Version

Practical Application

Error Span Filtering: Immediately Identify SLO-Violating Traces

Tracing Error Propagation: Finding Root Causes with Structural Operators

Per-Service P99 Latency Analysis: Identifying Bottleneck Services Without Instrumentation

Mimir Cross-Signal: Bidirectional Trace-Metric Drill-Down

Ad-hoc RED Metrics Analysis: Your Entire Service Landscape at a Glance

Pros and Cons

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

Next in the Series

References

Core Concepts

What is TraceQL

Key Field Namespaces

Structural Operators (Tempo 2.2+)

TraceQL Metrics Evolution by Version

Practical Application

Error Span Filtering: Immediately Identify SLO-Violating Traces

Tracing Error Propagation: Finding Root Causes with Structural Operators

Per-Service P99 Latency Analysis: Identifying Bottleneck Services Without Instrumentation

Mimir Cross-Signal: Bidirectional Trace-Metric Drill-Down

Ad-hoc RED Metrics Analysis: Your Entire Service Landscape at a Glance

Pros and Cons

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

Next in the Series

References

Recommended Posts

100% Error Span Collection, Up to 95% Cost Reduction — Grafana Alloy + OpenTelemetry Tail-Based Sampling Practical Guide

How to Never Miss Errors with Grafana Tempo TraceQL: A Practical Guide for Sampling Environments

OpenTelemetry Collector Tail-based Sampling: How to Preserve 100% of Errors & Slow Requests While Cutting Storage Costs by 70%

Grafana Loki + Tempo: Implementing Bidirectional Log-Trace Drill-Down with a Single Trace ID

Complete Guide to Prometheus + Grafana Monitoring — From Docker Compose to Kubernetes

Error Budget Automation: A Practical Implementation Guide to Blocking SLO Violations with GitOps Deployment Gates