A 2-Stage Pipeline Design Using OpenTelemetry Collector Tail Sampling to Retain 100% of Error & Latency Traces While Cutting Observability Costs by 74% (OTel Collector)
When operating distributed systems, you will inevitably encounter a situation where tracing data accumulates faster than expected. As traffic grows, the number of spans flowing into Jaeger (an open-source distributed tracing UI) or Tempo (Grafana's trace storage) increases not linearly but explosively, and at some point you receive a message from the infrastructure team saying "tracing storage costs have exceeded $25,000 per month." But blindly lowering the sampling rate risks the worst-case scenario: when an incident actually occurs, the error traces are gone. If you are new to the OTel ecosystem, it is recommended to first read the OpenTelemetry official Sampling concepts documentation.
The solution to this problem is a 2-stage pipeline combining the tail_sampling processor and the filter processor in OpenTelemetry Collector (hereafter OTel Collector). When the majority of all traces are normal traces, reducing them to the 5% level yields a maximum cost reduction of approximately 74% ($25,000 → $6,500, i.e., $18,500 ÷ $25,000 ≈ 74%). The actual savings depend on the proportion of error and latency traces, and since the key trade-off of this design is memory cost, measuring your current TPS (traces per second) is a good starting point.
By reading to the end of this article, you will have a working YAML configuration in hand within 20 minutes. We will walk through the entire flow — from traces sent in OTLP (OTel's standard transport protocol) format from the SDK (OpenTelemetry instrumentation library), through the Collector, and arriving at the backend — step by step, from basic configuration to high-volume scaling.
Core Concepts
Head-Based vs Tail-Based Sampling
Sampling approaches fall into two broad categories.
| Type | Decision Point | Basis for Decision | Characteristics |
|---|---|---|---|
| Head-Based Sampling | At span start | Probability or context flags only | Simple implementation, risk of missing error traces |
| Tail-Based Sampling | After all spans collected | Full context including error codes, latency, attributes | Can preserve only meaningful traces |
Tail-Based Sampling — An approach that decides "should this trace be retained?" after all spans comprising the trace have been collected. It can use information only knowable after trace completion — such as error codes and latency — as the basis for decisions, making it well-suited for selectively preserving only important traces.
Structure of the 2-Stage Pipeline
The overall architecture is structured as follows.
[SDK] → [Layer 1: Gateway Collector]
└─ loadbalancingexporter (routing_key: traceID)
↓
[Layer 2: Sampling Collector ×N]
├─ tail_sampling processor ← Stage 1: per-trace retention decision
├─ filter processor ← Stage 2: additional span-level filtering
└─ Backend (Jaeger / Tempo / OTLP)Stage 1 tail_sampling processor buffers the entire trace in memory, then evaluates the configured set of Policies to decide SAMPLE / DROP. It supports various policy types including status_code, latency, probabilistic, and composite; the full list can be found by version in the official README.
Stage 2 filter processor performs span-level filtering using OTTL (OpenTelemetry Transformation Language) conditional expressions. Its role is to additionally remove noisy paths such as /health and /metrics from traces that have passed tail sampling.
OTTL (OpenTelemetry Transformation Language) — A declarative conditional expression language used in OTel Collector's
filterandtransformprocessors. It uses syntax similar to SQL's WHERE clause to express attribute conditions, TraceState access, and nested logical operations.
Core Priority Principle
The priority of three policies is the heart of this design.
Error traces (status.code = ERROR) → status_code policy → 100% retained
Latency traces (latency > threshold) → latency policy → 100% retained
Normal traces → probabilistic → 5% sampledTrade-off Overview Before Adoption
Understanding the key trade-offs of this design before diving into the examples will help you more easily judge which example fits your situation.
| Item | Details |
|---|---|
| Complete visibility guarantee | 100% capture of error & latency traces — no data loss for post-incident debugging |
| Cost reduction | Drastically reduces storage & ingestion costs by cutting normal traces to 5–10% |
| High memory requirement | Spans must be buffered for decision_wait × TPS. 2–2.5 GiB recommended at 20k spans/s |
| Stateful architecture | All spans for the same trace must arrive at the same Collector → load balancing layer required |
Practical Application
Now let's implement these concepts in actual YAML. The examples increase in complexity, starting from a basic configuration and progressing through composite policies, Span Metrics derivation, and high-volume scaling.
Example 1: Basic Configuration — 100% Retention of Errors & Latency + 5% Sampling of Normal Traces
When you need it: This is the first configuration to apply when you want to reliably preserve error and slow traces at a traffic scale manageable by a single Collector instance.
Use tail_sampling to select important traces, and filter to remove health check paths.
processors:
tail_sampling:
decision_wait: 10s # Span collection wait time (waits for trace completion)
num_traces: 100000 # Maximum number of traces to hold in memory
expected_new_traces_per_sec: 10000
policies:
# Policy 1: 100% retention of error traces
- name: keep-errors
type: status_code
status_code:
status_codes: [ERROR]
# Policy 2: 100% retention of traces with latency > 1 second
- name: keep-slow-traces
type: latency
latency:
threshold_ms: 1000
# Policy 3: 5% probabilistic sampling of remaining normal traces
- name: sample-normal
type: probabilistic
probabilistic:
sampling_percentage: 5
# Stage 2: Remove health check and metrics paths
filter:
error_mode: ignore
traces:
span:
# Drop spans matching these conditions
- |
attributes["http.target"] == "/health" or
attributes["http.target"] == "/metrics"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, tail_sampling, filter, batch]
exporters: [otlp/tempo]Note — Semantic Conventions version:
attributes["http.target"]is based on OTel Semantic Conventions 1.x. Starting with Semconv 2.x,http.targetis deprecated and replaced byurl.path. If you are using a recent SDK, it is recommended to change this toattributes["url.path"].
| Configuration Item | Role | Recommended Value |
|---|---|---|
decision_wait |
Span completion wait time | 10–30s (based on service P99 response time) |
num_traces |
Maximum number of traces in memory | TPS × decision_wait or more |
sampling_percentage |
Normal trace sampling ratio | 5–10% |
error_mode: ignore |
Prevents drops on filter evaluation errors | Always recommended |
memory_limiterprocessor — A safety net to prevent OOM (Out of Memory) that rejects incoming spans when memory usage exceeds a threshold. It is essential to place it beforetail_sampling, and you can monitor drops via theotelcol_processor_refused_spansmetric. It is recommended to setGOMEMLIMITto 90% of the container memory limit.
Note: Processors that depend on request context, such as
k8sattributes, must be placed beforetail_sampling. This is becausetail_samplingloses the original context when reassembling spans into new batches.
Example 2: Dynamic Throughput Ratio Control with Composite Policies
When you need it: Use this when you want to set an upper bound on spans per second during traffic spikes, and declaratively adjust the throughput ratio between error, latency, VIP customer, and normal traces.
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: composite-policy
type: composite
composite:
max_total_spans_per_second: 2000 # Upper limit on maximum spans per second
policy_order: [errors, slow, vip, baseline]
composite_sub_policy:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 500 }
- name: vip
type: string_attribute
string_attribute:
key: customer.tier
values: ["enterprise", "premium"]
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 5 }
rate_allocation:
# The sum doesn't have to be 100, but setting it to 100 makes ratios intuitive
- policy: errors
percent: 40
- policy: slow
percent: 30
- policy: vip
percent: 20
- policy: baseline
percent: 10rate_allocation declares the upper bound on the proportion of spans each sub-policy can use within the max_total_spans_per_second budget. When the limit is reached, policies with higher priority in policy_order are processed first. Since the detailed behavior may vary by version, it is recommended to check the official README once more before applying to production.
Example 3: Pipeline for Deriving Span Metrics Before Sampling
When you need it: Use this when you want your RED metrics (Rate / Error / Duration) dashboard to be aggregated based on 100% traffic without distortion after sampling.
When generating RED metrics with the spanmetrics connector, aggregation must occur from 100% traffic before sampling. Aggregating after sampling distorts error rate and latency distribution figures by the sampling ratio.
The forward connector acts as a bridge that passes data as-is from one pipeline to another. Using this, you can split the same trace stream into two branches: metric derivation and sampling.
connectors:
forward: {} # Bridge that passes data as-is from traces/input → traces/sampling
spanmetrics: {} # Derives RED metrics from spans and passes them to the metrics pipeline
service:
pipelines:
# Ingestion pipeline: derive metrics from 100% traffic
traces/input:
receivers: [otlp]
processors: [memory_limiter, k8sattributes]
exporters: [spanmetrics, forward]
# Sampling pipeline: filter after metric derivation
traces/sampling:
receivers: [forward]
processors: [tail_sampling, filter, batch]
exporters: [otlp/tempo]
# Metrics pipeline
metrics:
receivers: [spanmetrics]
exporters: [prometheus]Example 4: Two-Tier Scaling Architecture for High-Volume Environments
When you need it: Use this when traffic exceeds the memory limits of a single Collector (tens of thousands of TPS or more), or when you need to horizontally scale multiple Collector instances.
At high traffic, spans from the same trace can be distributed across multiple Collectors. In this case, tail_sampling makes decisions seeing only some of the trace's spans, resulting in incomplete outcomes. The loadbalancingexporter uses traceID as the routing key to ensure all spans from the same trace always go to the same Layer 2 Collector.
# Layer 1: Gateway Collector — consistent routing based on traceID
exporters:
loadbalancing:
routing_key: traceID
protocol:
otlp:
tls: { insecure: true }
resolver:
dns:
hostname: otel-sampling-collector-headless.monitoring.svc.cluster.local
port: 4317
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loadbalancing]# Layer 2: Sampling Collector — apply the pipeline configuration from Example 1 or Example 2 as-is
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, tail_sampling, filter, batch]
exporters: [otlp/tempo]Since all spans with the same traceID are always delivered to the same Layer 2 Collector, consistency of tail sampling decisions is guaranteed.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Complete visibility guarantee | 100% capture of error & latency traces — no data loss needed for post-incident debugging |
| Cost reduction | Drastically reduces storage & ingestion costs by cutting normal traces to 5–10% (e.g., $25,000 → $6,500/month) |
| Rich policy expressiveness | Various policy types + OTTL conditional expressions enable complex business logic |
| Declarative configuration | YAML-based, allowing sampling ratio changes without code redeployment |
| Failure isolation | Failures in the sampling layer do not propagate to the gateway layer, improving stability |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| High memory requirement | Spans must be buffered for decision_wait × TPS |
2–2.5 GiB recommended at 20k spans/s; set GOMEMLIMIT to 90% of container limit |
| Stateful architecture | All spans for the same trace must arrive at the same Collector | loadbalancingexporter with traceID-based routing is required (see Example 4) |
decision_wait dilemma |
Too short risks missing spans; too long causes memory spikes | Typically 10–30s recommended, set based on service P99 response time |
| Cold start incompleteness | Traces arriving immediately after a Collector restart are evaluated in an incomplete state | Use rolling restart to sequentially restart instances |
| Context loss | Context-dependent processors cannot be used after tail_sampling |
k8sattributes and similar must be placed before tail_sampling |
Most Common Mistakes in Practice
- Placing
k8sattributesaftertail_sampling— Reassembled batches have no original request context, so Kubernetes metadata enrichment will not work. It is strongly recommended to maintain the processor order asmemory_limiter → k8sattributes → tail_sampling. - Leaving
num_tracesat its default value — It must be set to at leastexpected_new_traces_per_sec × decision_wait. If too low, old traces get dropped early and error traces may disappear along with them. - Aggregating Span Metrics in the post-sampling pipeline — RED metrics will be scaled down by the sampling ratio, displaying incorrect error rates and latency distributions on the dashboard. As shown in Example 3, aggregation must always happen in the pre-sampling pipeline.
Closing Thoughts
Since the key trade-off of this design is memory cost, measuring your current TPS and setting num_traces appropriately is the starting point for a successful adoption.
Three steps you can take right now:
- Understand your current trace volume — Check spans per second (TPS) via the
otelcol_receiver_accepted_spansmetric, and it is recommended to usenum_traces = TPS × 30as an initial value. - Deploy the basic pipeline — Start with the YAML from Example 1 with
sampling_percentage: 100to verify the pipeline operates correctly without trace loss, then gradually lower the ratio. Ifotelcol_processor_refused_spansincreases after lowering the sampling ratio, that is a signal to increasenum_traces. - Advance with Composite policies — Once cost reduction targets are confirmed, introduce the
compositepolicy from Example 2 to adjust the throughput ratio between error, VIP, and normal traces to match your service characteristics.
Next article: How to connect OTel Collector's
spanmetricsconnector with Prometheus to build a RED dashboard based on 100% pre-sampling traffic
References
- OpenTelemetry Official – Sampling Concepts
- Tail Sampling Processor README (opentelemetry-collector-contrib)
- Filter Processor README (opentelemetry-collector-contrib)
- OpenTelemetry Official Blog – Introduction to Tail Sampling
- OpenTelemetry Official Blog – 2025 Sampling Milestones
- Scaling the Collector (Official Docs)
- Loadbalancing Exporter README
- TraceState Probability Sampling Specification
- Adaptive Tail-Based Sampling with Dynamic Trace Enrichment (Medium)
- Tail-Based Sampling: Sizing, Memory & Cost Model – Michal Drozd
- Grafana Alloy – tail_sampling Component Reference
- Grafana Tempo – Tail Sampling Policies & Strategies
- Mastering the OpenTelemetry Filter Processor – Dash0
- Composing OTel Reference Architectures – Elastic
- Scale Alloy Tail Sampling – Grafana Docs