OpenTelemetry Collector Horizontal Scaling: Maintaining Tail Sampling Accuracy with Agent-Gateway Architecture and loadbalancingexporter
If you're already using the tail_sampling processor and considering scaling your Collector instances to two or more, this article addresses exactly your situation. When processing thousands of spans per second, there will inevitably come a point where a single Collector can no longer keep up. However, simply adding instances introduces a problem that can't be solved by scaling alone: degraded Tail Sampling accuracy. When spans belonging to the same trace are scattered across different Collector instances, even a request that produced an error may not be preserved by any instance.
This article targets backend and DevOps engineers who are running Kubernetes and have just started adopting OpenTelemetry. By combining the Agent-Gateway two-tier architecture with the loadbalancingexporter, you can maintain Tail Sampling aggregation accuracy even in a horizontally scaled environment.
Core Concepts
Why Tail Sampling Breaks Under Horizontal Scaling
Head Sampling decides immediately at the start of a request whether to record it. Tail Sampling, on the other hand, makes its decision after all spans in a trace have been collected, by looking at the complete picture. This allows far more sophisticated policy configuration, since decisions can be based on actual outcomes — whether an error occurred, total latency, specific attribute values, and so on.
The problem is that this approach requires the premise that "all spans must reside on the same instance." When you scale out to three Collectors and a load balancer distributes spans randomly, the spans that make up a single trace end up split across different instances.
# The problem that arises with naive horizontal scaling
Trace ABC's Spans 1,4,7 → Gateway-1 (no errors → drop decision)
Trace ABC's Spans 2,3,5 → Gateway-2 (incomplete trace → drop decision)
Trace ABC's Span 6 (ERROR) → Gateway-3 (only 1 span → drop decision)
Result: Trace ABC, which contained an error, is not preserved by any instanceAgent-Gateway Two-Tier Architecture
The standard architecture for solving this problem is the Agent-Gateway two-tier architecture.
Application
│ OTLP
▼
[Agent Collector] ─── loadbalancingexporter (routing_key: traceID) ───▶ [Gateway-1] ── tailsamplingprocessor
(DaemonSet) ├──▶ [Gateway-2] ── tailsamplingprocessor
└──▶ [Gateway-3] ── tailsamplingprocessor| Tier | Deployment | Role |
|---|---|---|
| Tier 1 (Agent) | DaemonSet / Sidecar | Resource detection, basic filtering, batch processing, then forwarding to Gateway |
| Tier 2 (Gateway) | Deployment (horizontally scaled) | Tail Sampling, metrics aggregation, sensitive data removal, multi-backend export |
Agents are deployed close to applications and handle only lightweight processing. All heavy processing is delegated to the Gateway tier, and the loadbalancingexporter determines which Gateway instance to send to.
Consistent Hashing in loadbalancingexporter
The loadbalancingexporter takes a trace ID as input and routes it to the same Gateway instance every time via consistent hashing. Because the same trace ID always goes to the same Gateway regardless of which Agent processes it, Tail Sampling can see the complete trace.
Consistent Hashing: A hashing method that minimizes the number of keys redistributed when nodes are added or removed. With ordinary hashing, most keys move to different nodes when the node count changes, but with consistent hashing only the keys associated with the changed node are redistributed. When a Gateway pod is added, only a fraction of all traces move to the new instance.
The resolver determines how to obtain the list of Gateway instances.
| Resolver | Use Case | Notes |
|---|---|---|
static |
Small fixed environments, testing | No additional permissions required |
dns |
Automatic pod discovery via Kubernetes Headless Service | No additional RBAC required |
k8s |
Direct Kubernetes API queries, faster pod change detection | Requires Endpoints list/watch RBAC |
The examples in this article use the dns resolver. It can be applied immediately in most Kubernetes environments without additional RBAC configuration, lowering the barrier to initial adoption. The k8s resolver detects pod changes more quickly but requires separate ClusterRole and ServiceAccount configuration.
Practical Application
Agent DaemonSet Configuration
One Agent is deployed per node, collecting spans from applications on that node and forwarding them to the Gateway. Trace ID-based routing is performed via the loadbalancingexporter.
# agent-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
batch:
timeout: 5s
send_batch_size: 1000
exporters:
loadbalancing:
routing_key: "traceID"
protocol:
otlp:
tls:
insecure: true
resolver:
dns:
hostname: "gateway-collector-headless.observability.svc.cluster.local"
port: 4317
interval: 5s # Delay for reflecting Gateway Pod additions/removals
timeout: 1s # Allowed time for DNS lookup failure
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loadbalancing]| Configuration | Description |
|---|---|
routing_key: traceID |
Trace ID-based routing (Service Name is also an option) |
resolver.dns.hostname |
Kubernetes Headless Service domain |
interval: 5s |
DNS re-query interval. Shorter values reflect pod changes faster but increase DNS load |
timeout: 1s |
DNS query timeout. If no response within this time, the previous list is retained |
processors: [memory_limiter, batch] |
Agent handles only lightweight processing — no tail_sampling |
The key point is not placing the tail_sampling processor on the Agent side. All sampling decisions are made at the Gateway.
Gateway Deployment and Headless Service Configuration
The Gateway handles heavy processing including Tail Sampling. The Kubernetes Headless Service exposes each pod's IP directly, allowing the loadbalancingexporter to build its consistent hashing ring.
# gateway-headless-svc.yaml
apiVersion: v1
kind: Service
metadata:
name: gateway-collector-headless
namespace: observability
spec:
clusterIP: None # Headless: DNS queries return individual Pod IP list
selector:
app: gateway-collector
ports:
- port: 4317
name: otlp-grpc# gateway-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: gateway-collector
namespace: observability
spec:
replicas: 3
selector:
matchLabels:
app: gateway-collector
template:
metadata:
labels:
app: gateway-collector
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values: [gateway-collector]
topologyKey: kubernetes.io/hostname
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:0.115.0
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "8Gi"
cpu: "2000m"# gateway-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
processors:
memory_limiter:
check_interval: 1s
limit_mib: 6144 # ~75% of Deployment limits.memory
spike_limit_mib: 1024
tail_sampling:
decision_wait: 30s
num_traces: 50000
expected_new_traces_per_sec: 1000
sampled_cache_size: 100000
non_sampled_cache_size: 100000
policies:
- name: errors-policy
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow-traces-policy
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic-policy
type: probabilistic
probabilistic: {sampling_percentage: 10}
batch:
timeout: 5s
send_batch_size: 1000
exporters:
otlp/backend:
endpoint: "backend:4317"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/backend]| Configuration | Description |
|---|---|
decision_wait: 30s |
Maximum time to wait for trace completion. In service mesh environments (Envoy/Linkerd sidecars), proxy latency accumulates — it is recommended to set this to the service P99 response time plus 10–15 seconds of buffer |
num_traces: 50000 |
Number of concurrently processed traces. Assuming an average span size of ~1KB and an average of 20 spans per trace, 50,000 traces ≈ 1GB as a baseline; plan for 2–3x that when including indexing overhead |
sampled_cache_size |
An LRU cache that applies prior decisions to spans that arrive late, after a trace has been evicted from memory |
policies order |
Evaluated top to bottom; if any policy results in sampled, that trace is preserved |
PodDisruptionBudget (PDB): A Kubernetes policy that guarantees a minimum number of available pods when pods are forcibly terminated. Applying a PDB to the Gateway tier prevents trace loss during node maintenance or upgrades. Starting with
minAvailable: 2is recommended.
Span Metrics Pipeline Configuration
When deriving RED metrics (Request rate, Error rate, Duration) from trace data, care must be taken to ensure that spans dropped by Tail Sampling are not omitted from the metrics. The spanmetricsconnector is declared in the connectors section, not processors, and the full span count must be calculated first in a separate pipeline that is completely independent from the Tail Sampling pipeline.
# Gateway config: correct spanmetrics configuration
connectors:
spanmetrics:
histogram:
explicit:
buckets: [100ms, 250ms, 500ms, 1s, 5s]
dimensions:
- name: http.method
- name: http.status_code
service:
pipelines:
# Pipeline 1: Calculate metrics from all traces (before sampling)
traces/metrics:
receivers: [otlp]
processors: [memory_limiter]
exporters: [spanmetrics] # connector used as exporter
# Pipeline 2: Tail Sampling (runs independently, shares the same receiver)
traces/sampling:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/backend]
# Pipeline 3: Metrics output
metrics:
receivers: [spanmetrics] # connector used as receiver
exporters: [prometheus]The otlp receiver is shared between the two pipelines (traces/metrics and traces/sampling), so the same span is delivered to both. The traces/metrics pipeline aggregates all spans regardless of drop decisions, while the traces/sampling pipeline performs Tail Sampling independently.
spanmetricsconnector: An OpenTelemetry Collector component that automatically generates RED metrics — including latency histograms, call counts, and error rates — from trace span data. It is declared in the
connectorssection rather thanprocessors, and is used simultaneously as both an exporter (output) and a receiver (input) in the pipeline.
Trade-off Analysis
Advantages
| Item | Detail |
|---|---|
| Tail Sampling accuracy | Trace ID-based routing guarantees complete trace aggregation |
| Horizontal scalability | Only the Gateway tier needs to be scaled out independently |
| Fault isolation | Agent failures do not propagate to the Gateway, and vice versa |
| Flexible policy management | Sampling policies can be centralized at the Gateway for consistent management |
| Automatic service discovery | Gateway pod additions/removals are automatically reflected via the DNS/k8s resolver |
Disadvantages and Caveats
| Item | Detail | Mitigation |
|---|---|---|
| Memory pressure | All spans must be kept in memory for the duration of decision_wait. At ~1KB per span and 20 spans per trace, 50,000 traces ≈ 2–3GB including overhead |
Set memory_limiter's limit_mib to no more than 75% of the container memory limit, and tune num_traces to match actual traffic |
| Rehashing problem | When the number of Gateway pods changes, some traces may be redistributed to different instances and become incomplete | Blue/green switchover is recommended over rolling updates; changes should be made in minimal increments |
| Single point of failure risk | A full Gateway tier outage results in trace loss | Apply PodDisruptionBudget (minAvailable: 2) and anti-affinity rules |
| Late span arrival | Spans that arrive after decision_wait are dropped |
Tune decision_wait to service P99 response time plus buffer (10–15 seconds) |
| Operational complexity | Both tiers must be managed with separate monitoring and alerting systems | Configure per-tier alerts based on the metrics below |
# Key monitoring metrics for the Gateway tier
otelcol_exporter_queue_size # Export queue utilization (scale-out signal when exceeding 60–70%)
otelcol_processor_dropped_spans_total # Dropped span count (requires immediate attention if it spikes)
otelcol_processor_tail_sampling_* # Tail Sampling decision statistics
otelcol_receiver_accepted_spans_total # Inbound throughputIf otelcol_receiver_accepted_spans_total values are unevenly distributed across Gateway instances, consider reducing the DNS TTL or lowering the interval value. If the imbalance persists, switching from the dns resolver to the k8s resolver will improve pod change detection speed (note: Endpoints list/watch RBAC configuration is required).
The Most Common Mistakes in Practice
-
Adding
tail_samplingto the Agent as well: Agents see only a portion of the full trace, making accurate sampling decisions impossible.tail_samplingmust be placed exclusively on the Gateway tier. -
Using a regular ClusterIP Service instead of a Headless Service: A regular Service returns a single virtual IP on DNS queries. Since the
loadbalancingexporterneeds to receive a list of individual pod IPs to build its consistent hashing ring, a Headless Service withclusterIP: Noneis strictly required. -
Declaring
spanmetricsconnectorin theprocessorssection: Thespanmetricsconnectormust be declared in theconnectorssection and branched into a separate pipeline. Placing it inprocessorscauses the Collector to refuse to start, and it is also impossible to run it before Tail Sampling within the same pipeline. The pipelines must always be split intotraces/metricsandtraces/sampling.
Closing Thoughts
The combination of the Agent-Gateway two-tier architecture and the loadbalancingexporter is the officially recommended pattern by the CNCF, and is a proven OpenTelemetry Collector horizontal scaling pattern for production environments processing tens of thousands of spans per second.
Three steps you can take right now:
-
Review your current architecture: Run
kubectl get pods -n observabilityto check your current Collector configuration, and determine whether Tail Sampling is running on a single instance or is already in a distributed environment. -
Deploy the Headless Service first, then apply the Gateway Deployment: It is recommended to deploy in the order
kubectl apply -f gateway-headless-svc.yaml && kubectl apply -f gateway-deployment.yaml. The Service must come up first so that theloadbalancingexportercan successfully resolve DNS at startup. -
Apply
loadbalancingexporterto the Agent and validate: After adding theloadbalancingexporterto the Agent configuration, compare theotelcol_receiver_accepted_spans_totalmetric per instance to verify that each Gateway is receiving traces evenly.
Next article: We will explore how to integrate HPA (Horizontal Pod Autoscaler) with the Gateway tier using the OpenTelemetry Operator's
TargetAllocator, enabling automatic response to traffic spikes while maintaining Tail Sampling accuracy.
References
- Scaling the Collector | OpenTelemetry Official Docs
- Gateway Deployment Pattern | OpenTelemetry Official Docs
- loadbalancingexporter README | GitHub opentelemetry-collector-contrib
- tailsamplingprocessor README | GitHub opentelemetry-collector-contrib
- Tail Sampling with OpenTelemetry: Why it's useful | OpenTelemetry Blog
- Sampling Concepts | OpenTelemetry Official Docs
- Scale Alloy Tail Sampling | Grafana Official Docs
- otelcol.exporter.loadbalancing | Grafana Alloy Official Docs
- Composing OpenTelemetry Reference Architectures | Elastic Observability Labs
- Patterns for Deploying OpenTelemetry Collector at Scale | SigNoz