Complete Guide to loadbalancingexporter: Guaranteeing Tail Sampling Accuracy with a 2-Tier Architecture
When operating a Kubernetes-based Observability pipeline, you may one day encounter this situation: you have 4 Tier 2 Collector Pods, but 80% of all traffic is concentrated on just 1 of them, while the remaining 3 are nearly idle. Even more critically, requests exceeding latency thresholds and traces with errors are passing through without being caught by the tail sampling policy. There is only one root cause: the standard Kubernetes Service's round-robin distributes spans belonging to the same trace across multiple instances, preventing any single instance from seeing the complete trace.
loadbalancingexporter solves this problem with a consistent hash ring. By hashing on traceID, all spans of the same trace are always routed to the same Tier 2 instance. After reading this article, you will be able to choose and configure the DNS resolver and k8s resolver for your situation, minimize data loss during Pod restarts with 2-level Resiliency, and detect real-time traffic skew (load skew, uneven distribution) between instances using PromQL queries.
This article is primarily aimed at teams operating OpenTelemetry Collectors in a Kubernetes environment or preparing for horizontal scaling. An appendix for AWS ECS/Fargate environments is included at the end.
Core Concepts
Consistent Hashing and loadbalancingexporter
loadbalancingexporter is an exporter component included in OpenTelemetry Collector Contrib. Its core role is to deliver data to the same downstream Collector instance at all times via a consistent hash ring, using one of three routing keys (routing_key): traceID, service, or streamID.
What is Consistent Hashing? A distributed algorithm designed to minimize hash result changes when nodes are added or removed. Unlike standard modular hashing, it does not redistribute all keys when nodes are added or removed, making it resilient to topology changes.
Hash ring operation example: The diagram below shows how each traceID is assigned to its nearest node with 3 nodes, and how only some keys are redistributed when node-D is added.
Hash ring (3 nodes → adding node-D)
node-A
/ \
traceID-1 traceID-3 ─── (after adding node-D → migrated to node-D)
| |
node-C node-B
\ /
traceID-2
Adding node-D → only traceID-3 is redistributed; traceID-1 and traceID-2 remain on existing nodes| routing_key | Primary Use | Default Signal |
|---|---|---|
traceID |
Trace-based tail sampling | traces |
service |
Per-service load isolation | metrics |
streamID |
gRPC stream affinity | - |
streamID is used in scenarios where metrics are collected via gRPC streaming. It is used when the same gRPC stream must always be pinned to the same Collector instance, and unlike traceID and service, it is not applied by default to a specific signal type.
Why Standard K8s Services Fall Short
The default behavior of a Kubernetes Service (round-robin or IPVS load balancing) is fatal for tail sampling pipelines. When spans of the same trace are distributed across multiple Collector instances, each instance holds only a portion of the trace, making it impossible to evaluate policies such as "does the total latency of this trace exceed 500ms?"
Overall 2-Tier Architecture
[Application / Agent]
│
[Tier 1 Collector] ← loadbalancingexporter (consistent hash by traceID)
│
[Tier 2 Collector] ← tailsamplingprocessor (same traceID → same instance guaranteed)
│
[Backend (Jaeger / Grafana Tempo, etc.)]- Tier 1: Hashes received spans by traceID and forwards them to a specific Tier 2 instance. Freely horizontally scalable, holds almost no state.
- Tier 2: All spans for the same traceID are gathered at the same instance, enabling accurate application of tail sampling policies. Spans are held in memory during
decision_wait.
Key point: Tier 1 and Tier 2 must be separated into distinct Deployments. Combining them in the same Pod will interrupt data reception entirely during Tier 2 redeployment.
Practical Application
Example 1: Dynamically Resolving Pod IPs with DNS Resolver
Using a Kubernetes headless service, you can directly receive each Pod's IP via DNS A records. The dns resolver of loadbalancingexporter periodically re-queries this list of IPs and updates the hash ring.
First, declare the Tier 2 service as headless.
# tier2-headless-service.yaml
apiVersion: v1
kind: Service
metadata:
name: otelcol-tier2-headless
namespace: observability
spec:
clusterIP: None # Key setting for headless service
selector:
app: otelcol-tier2
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317Next, configure the dns resolver on the Tier 1 Collector.
# tier1-collector-config.yaml
exporters:
loadbalancing:
routing_key: traceID
protocol:
otlp:
timeout: 1s
tls:
insecure: true
resolver:
dns:
hostname: otelcol-tier2-headless.observability.svc.cluster.local
port: 4317
interval: 5s # DNS re-query interval (time to reflect Pod scaling events)
timeout: 1s
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [loadbalancing] # connect with exporter named 'loadbalancing'Note: Without
clusterIP: None, DNS returns only a single ClusterIP, resulting in no load balancing at all. After applying the configuration, it is recommended to runkubectl exec -it <pod> -n observability -- nslookup otelcol-tier2-headless.observability.svc.cluster.localand verify that multiple A records are returned.
| Configuration Item | Role | Recommended Value |
|---|---|---|
hostname |
Headless service FQDN | Full path including namespace recommended |
interval |
DNS re-query interval | 5–30s (adjust based on topology change frequency) |
timeout |
DNS query timeout | 1–3s |
clusterIP: None |
Headless declaration | Must be None to return Pod IPs |
Example 2: Instantly Reflecting Topology Changes with k8s Resolver
The k8s resolver watches Kubernetes EndpointSlices and reflects Pod additions and removals much faster than DNS polling. With the existing Endpoints resource being deprecated in Kubernetes 1.33+, the EndpointSlice-based k8s resolver is now recommended.
# tier1-collector-config.yaml (k8s resolver version)
exporters:
loadbalancing:
routing_key: traceID
protocol:
otlp:
timeout: 1s
tls:
insecure: true
resolver:
k8s:
service: otelcol-tier2.observability
ports:
- 4317
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [loadbalancing]Using the k8s resolver requires RBAC permissions on the Collector ServiceAccount. A ClusterRole is used because EndpointSlice watching may need to cross namespace boundaries. A namespace-scoped Role cannot watch endpoints in other namespaces.
# rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: otelcol-tier1-role
rules:
- apiGroups: ["discovery.k8s.io"]
resources: ["endpointslices"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: otelcol-tier1-rolebinding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: otelcol-tier1-role
subjects:
- kind: ServiceAccount
name: otelcol-tier1
namespace: observability| Comparison | dns resolver | k8s resolver |
|---|---|---|
| Reflection method | Periodic DNS polling | EndpointSlice watch (event-based) |
| Reflection speed | Seconds to tens of seconds depending on interval setting | Nearly immediate (within a few seconds) |
| Additional setup | Headless service required | RBAC required |
| K8s version dependency | None | 1.21+ (EndpointSlice GA) |
Example 3: Integrating 2-Level Resiliency with Health Checks
In the default configuration, retry and queue are disabled. Spans routed while a Tier 2 Pod is restarting are immediately lost. In production environments, 2-level Resiliency must be explicitly enabled.
num_consumers is the number of goroutines that simultaneously consume spans from the queue; it is recommended to start at 2–3x the number of Tier 2 Pods. queue_size can be estimated as the number of incoming batches per second × expected maximum retry time in seconds. For example, in an environment with 100 batches per second and a maximum retry time of 10 seconds, 1000 is a reasonable baseline.
# tier1-collector-config.yaml (with Resiliency settings)
exporters:
loadbalancing:
routing_key: traceID
resolver:
dns:
hostname: otelcol-tier2-headless.observability.svc.cluster.local
interval: 5s
# Resiliency Level 1: Loadbalancer-level retry
queue:
enabled: true
num_consumers: 10 # Start at 2–3x the number of Tier 2 Pods
queue_size: 1000 # Estimate as batches/sec × max retry time (seconds)
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
protocol:
otlp:
timeout: 5s
# Resiliency Level 2: Sub-exporter (per-connection) level retry
retry_on_failure:
enabled: true
initial_interval: 1s
max_interval: 10s
sending_queue:
enabled: true
queue_size: 100
extensions:
health_check:
endpoint: 0.0.0.0:13133 # Expose /health endpointNote: With this configuration alone, the queue is reset on Collector restart. Because
loadbalancingexporterdoes not support Persistent Queue, all sub-exporters share the same in-memory queue. To minimize the span loss window during Collector restarts, it is recommended to pair this with a fast recovery strategy for Tier 2 (RollingUpdate deployment, preStop hooks, etc.).
2-Level Resiliency Model: Level 2 (sub-exporter) handles connectivity issues with individual Tier 2 instances, while Level 1 (loadbalancer) manages retries from a whole-pipeline perspective. Configuring both levels together maximizes fault isolation.
The following configuration integrates the health check endpoint with Kubernetes liveness/readiness probes.
# tier1-deployment.yaml (probe section)
livenessProbe:
httpGet:
path: /health
port: 13133
initialDelaySeconds: 10
periodSeconds: 15
readinessProbe:
httpGet:
path: /health
port: 13133
initialDelaySeconds: 5
periodSeconds: 10Example 4: Completing the Tier 2 Collector Configuration
The Tier 1 configuration alone does not complete the full pipeline. For actual tail sampling to occur in Tier 2, a configuration including tailsamplingprocessor is required.
What is Tail Sampling? A method where the sampling decision is made after all spans of a trace have been collected, by examining the full trace data. Condition-based policies such as latency threshold exceeded or error occurrence are possible, but the prerequisite is that all spans must be gathered on the same instance.
# tier2-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
tail_sampling:
# decision_wait: time to wait after receiving the last span before making a sampling decision
# All spans of the trace are held in memory during this time
decision_wait: 30s
num_traces: 50000 # Maximum number of traces to hold in memory simultaneously
policies:
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
- name: latency-policy
type: latency
latency:
threshold_ms: 500
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 10 # Remaining traces sampled at 10%
exporters:
otlp:
endpoint: jaeger-collector:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling]
exporters: [otlp]
decision_waitand the memory trade-off: A higherdecision_waitvalue improves span collection completeness, but all traces are held in memory for that duration. Memory consumption equals traces per second ×decision_wait(seconds) × average span size, so when thenum_traceslimit is exceeded, the oldest traces are evicted first. Advanced tuning oftailsamplingprocessorwill be covered in detail in a future article.
Example 5: Real-Time Monitoring of Traffic Skew
You can monitor the variance in received span counts across Tier 2 instances using Prometheus metrics.
# Span send rate per Tier 2 instance (over 5 minutes)
sum by (exporter) (
rate(otelcol_exporter_sent_spans_total[5m])
)How to verify labels: The actual
exporterlabel values in your metrics may appear asloadbalancing/0,loadbalancing/1, or differ by environment. It is recommended to first check the actual label keys and values withlabel_values(otelcol_exporter_sent_spans_total, exporter)and then adjust the query accordingly.
If there is a large variance between instances, it is recommended to first check the following two things:
- Check traceID generation method: Non-standard methods that generate sequential IDs can concentrate traffic on a specific node in the hash ring. A random 128-bit ID following the W3C Trace Context standard is recommended.
- Check routing_key configuration: Review whether a routing_key different from your intent has been configured.
Pros and Cons Analysis
Advantages
| Item | Detail |
|---|---|
| Tail sampling accuracy | All spans of the same traceID are gathered at the same Collector, enabling accurate sampling policy evaluation |
| Dynamic topology support | dns/k8s/aws_cloud_map resolvers automatically reflect Pod scaling events |
| Deterministic routing | Multiple Tier 1 instances with the same configuration always produce the same routing result |
| Multi-signal support | Can process traces, metrics, and logs; routing_key can be configured differently per signal type |
| Flexible routing keys | Three routing_key options — traceID / service / streamID — can be used selectively |
Disadvantages and Caveats
| Item | Detail | Mitigation |
|---|---|---|
| Unsafe defaults | retry/queue disabled by default; immediate data loss on failure | Explicitly enable 2-level retry/queue |
| No Persistent Queue support | All sub-exporters share the same in-memory queue; per-exporter persistent queues are not possible | Set sufficient queue size; pair with fast Tier 2 recovery strategy |
| DNS TTL delay | dns resolver introduces a delay of seconds to tens of seconds in reflecting topology changes | Optimize interval value or consider switching to k8s resolver |
| k8s resolver RBAC | Without EndpointSlice watch permissions, the backend list cannot be updated | Explicitly specify endpointslices permission in ClusterRole |
| Hash skew risk | Non-standard traceID generation (sequential IDs, etc.) can concentrate traffic on a specific Tier 2 | Use W3C standard random 128-bit traceIDs |
| v0.141.0 dns:/// bug | Regression exists that disables gRPC-level DNS round-robin configuration | Recommended to use v0.145.0 or later |
Most Common Mistakes in Practice
-
Failing to diagnose the cause after not configuring headless service: Using the dns resolver without
clusterIP: Nonereturns only a single ClusterIP, resulting in no load balancing whatsoever. If runningnslookup otelcol-tier2-headless.observability.svc.cluster.localfrom inside a Pod after applying the configuration returns only a single IP, the headless configuration is missing. It is recommended to verify withkubectl get svc -n observabilitythat theCLUSTER-IPcolumn showsNone. -
Silent data loss from trusting retry/queue defaults: With default settings, a Tier 2 Pod restart alone causes spans worth seconds to tens of seconds to be silently dropped with no errors. If you see
"Dropping data because sending_queue is full"or connection failure messages in Collector logs, that is a signal that Resiliency settings are missing. It is recommended to use a checklist to verify that 2-level Resiliency configuration is included before deploying to production. -
Difficulty tracing symptoms after not applying k8s resolver RBAC: Without
endpointslicespermission in the ClusterRole, the resolver cannot update the backend list and continues routing only to existing Pod IPs. Symptoms may include traffic still being directed to old IPs even after Pods have been replaced. It is recommended to check resolver-related logs withkubectl logs <tier1-pod> -n observability | grep -i "endpoint\|resolver\|backend".
Conclusion
The loadbalancingexporter 2-tier architecture is a pattern recommended by the official OpenTelemetry documentation and Grafana that simultaneously achieves tail sampling accuracy and horizontal scalability.
Three steps you can take right now:
-
Inspect your current configuration: Check the resolver method with
kubectl exec -it <tier1-pod> -n observability -- cat /conf/config.yaml, and verify withkubectl get svc -n observabilitythat theCLUSTER-IPcolumn for the Tier 2 service showsNone. If it is a headless service, also runnslookupfrom inside a Pod to confirm that multiple A records are returned. -
Upgrade version and enable Resiliency: If your current Collector version is around v0.141.0, it may be affected by the dns:/// regression bug, so upgrading to v0.145.0 or later is recommended. Then explicitly add
retry_on_failureandqueuesettings to the Tier 1 configuration file and deploy. After deployment, you can verify that the resolver is correctly detecting backends by looking for messages in the form"backend <IP>:4317 added"in the Collector logs. -
Add traffic skew monitoring: Add the query
sum by (exporter) (rate(otelcol_exporter_sent_spans_total[5m]))to Prometheus and visualize it on a Grafana dashboard. First confirm the actual label values withlabel_values(otelcol_exporter_sent_spans_total, exporter)and then adjust the query. If the variance between instances exceeds 2x persistently, it is recommended to re-examine the traceID generation method or the routing_key configuration.
Next article: Deep dive into
tailsamplingprocessorpolicies — designing composite policies based on latency, error rate, and attributes, and optimizing memory usage by tuningdecision_wait
Appendix: Cloud Map Resolver for AWS ECS/Fargate Environments
In AWS ECS/Fargate environments outside of Kubernetes, the aws_cloud_map resolver can be used.
resolver:
aws_cloud_map:
namespace: "otel-collectors"
service_name: "tier2-collector"
port: 4317
interval: 30s
timeout: 5sReferences
- loadbalancingexporter README | OpenTelemetry Collector Contrib
- Collector Scaling Guide | OpenTelemetry Official
- Gateway Deployment Pattern | OpenTelemetry Official
- Tail Sampling Concepts and 2-Tier Architecture | OTel Official Blog
- Scale Alloy tail sampling | Grafana Official Documentation
- OpenTelemetry Resiliency Guide | OpenTelemetry Official
- k8s resolver EndpointSlice migration issue #40871 | GitHub
- dns:/// scheme regression issue #14372 | GitHub
- k8s resolver example | OpenTelemetry Collector Contrib GitHub
- OpenTelemetry Collector Reference Architectures | Elastic