Complete Guide to loadbalancingexporter: Guaranteeing Tail Sampling Accuracy with a 2-Tier Architecture

When operating a Kubernetes-based Observability pipeline, you may one day encounter this situation: you have 4 Tier 2 Collector Pods, but 80% of all traffic is concentrated on just 1 of them, while the remaining 3 are nearly idle. Even more critically, requests exceeding latency thresholds and traces with errors are passing through without being caught by the tail sampling policy. There is only one root cause: the standard Kubernetes Service's round-robin distributes spans belonging to the same trace across multiple instances, preventing any single instance from seeing the complete trace.

loadbalancingexporter solves this problem with a consistent hash ring. By hashing on traceID, all spans of the same trace are always routed to the same Tier 2 instance. After reading this article, you will be able to choose and configure the DNS resolver and k8s resolver for your situation, minimize data loss during Pod restarts with 2-level Resiliency, and detect real-time traffic skew (load skew, uneven distribution) between instances using PromQL queries.

This article is primarily aimed at teams operating OpenTelemetry Collectors in a Kubernetes environment or preparing for horizontal scaling. An appendix for AWS ECS/Fargate environments is included at the end.

Core Concepts

Consistent Hashing and loadbalancingexporter

loadbalancingexporter is an exporter component included in OpenTelemetry Collector Contrib. Its core role is to deliver data to the same downstream Collector instance at all times via a consistent hash ring, using one of three routing keys (routing_key): traceID, service, or streamID.

What is Consistent Hashing? A distributed algorithm designed to minimize hash result changes when nodes are added or removed. Unlike standard modular hashing, it does not redistribute all keys when nodes are added or removed, making it resilient to topology changes.

Hash ring operation example: The diagram below shows how each traceID is assigned to its nearest node with 3 nodes, and how only some keys are redistributed when node-D is added.

Hash ring (3 nodes → adding node-D)
 
               node-A
             /         \
    traceID-1            traceID-3 ─── (after adding node-D → migrated to node-D)
          |                  |
       node-C              node-B
             \         /
              traceID-2
 
Adding node-D → only traceID-3 is redistributed; traceID-1 and traceID-2 remain on existing nodes

routing_key	Primary Use	Default Signal
`traceID`	Trace-based tail sampling	traces
`service`	Per-service load isolation	metrics
`streamID`	gRPC stream affinity	-

streamID is used in scenarios where metrics are collected via gRPC streaming. It is used when the same gRPC stream must always be pinned to the same Collector instance, and unlike traceID and service, it is not applied by default to a specific signal type.

Why Standard K8s Services Fall Short

The default behavior of a Kubernetes Service (round-robin or IPVS load balancing) is fatal for tail sampling pipelines. When spans of the same trace are distributed across multiple Collector instances, each instance holds only a portion of the trace, making it impossible to evaluate policies such as "does the total latency of this trace exceed 500ms?"

Overall 2-Tier Architecture

[Application / Agent]
         │
[Tier 1 Collector]  ← loadbalancingexporter (consistent hash by traceID)
         │
[Tier 2 Collector]  ← tailsamplingprocessor (same traceID → same instance guaranteed)
         │
[Backend (Jaeger / Grafana Tempo, etc.)]

Tier 1: Hashes received spans by traceID and forwards them to a specific Tier 2 instance. Freely horizontally scalable, holds almost no state.
Tier 2: All spans for the same traceID are gathered at the same instance, enabling accurate application of tail sampling policies. Spans are held in memory during decision_wait.

Key point: Tier 1 and Tier 2 must be separated into distinct Deployments. Combining them in the same Pod will interrupt data reception entirely during Tier 2 redeployment.

Practical Application

Example 1: Dynamically Resolving Pod IPs with DNS Resolver

Using a Kubernetes headless service, you can directly receive each Pod's IP via DNS A records. The dns resolver of loadbalancingexporter periodically re-queries this list of IPs and updates the hash ring.

First, declare the Tier 2 service as headless.

yaml

# tier2-headless-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: otelcol-tier2-headless
  namespace: observability
spec:
  clusterIP: None          # Key setting for headless service
  selector:
    app: otelcol-tier2
  ports:
    - name: otlp-grpc
      port: 4317
      targetPort: 4317

Next, configure the dns resolver on the Tier 1 Collector.

yaml

# tier1-collector-config.yaml
exporters:
  loadbalancing:
    routing_key: traceID
    protocol:
      otlp:
        timeout: 1s
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otelcol-tier2-headless.observability.svc.cluster.local
        port: 4317
        interval: 5s    # DNS re-query interval (time to reflect Pod scaling events)
        timeout: 1s
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [loadbalancing]   # connect with exporter named 'loadbalancing'

Note: Without clusterIP: None, DNS returns only a single ClusterIP, resulting in no load balancing at all. After applying the configuration, it is recommended to run kubectl exec -it <pod> -n observability -- nslookup otelcol-tier2-headless.observability.svc.cluster.local and verify that multiple A records are returned.

Configuration Item	Role	Recommended Value
`hostname`	Headless service FQDN	Full path including namespace recommended
`interval`	DNS re-query interval	5–30s (adjust based on topology change frequency)
`timeout`	DNS query timeout	1–3s
`clusterIP: None`	Headless declaration	Must be `None` to return Pod IPs

Example 2: Instantly Reflecting Topology Changes with k8s Resolver

The k8s resolver watches Kubernetes EndpointSlices and reflects Pod additions and removals much faster than DNS polling. With the existing Endpoints resource being deprecated in Kubernetes 1.33+, the EndpointSlice-based k8s resolver is now recommended.

yaml

# tier1-collector-config.yaml (k8s resolver version)
exporters:
  loadbalancing:
    routing_key: traceID
    protocol:
      otlp:
        timeout: 1s
        tls:
          insecure: true
    resolver:
      k8s:
        service: otelcol-tier2.observability
        ports:
          - 4317
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [loadbalancing]

Using the k8s resolver requires RBAC permissions on the Collector ServiceAccount. A ClusterRole is used because EndpointSlice watching may need to cross namespace boundaries. A namespace-scoped Role cannot watch endpoints in other namespaces.

yaml

# rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otelcol-tier1-role
rules:
  - apiGroups: ["discovery.k8s.io"]
    resources: ["endpointslices"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otelcol-tier1-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otelcol-tier1-role
subjects:
  - kind: ServiceAccount
    name: otelcol-tier1
    namespace: observability

Comparison	dns resolver	k8s resolver
Reflection method	Periodic DNS polling	EndpointSlice watch (event-based)
Reflection speed	Seconds to tens of seconds depending on interval setting	Nearly immediate (within a few seconds)
Additional setup	Headless service required	RBAC required
K8s version dependency	None	1.21+ (EndpointSlice GA)

Example 3: Integrating 2-Level Resiliency with Health Checks

In the default configuration, retry and queue are disabled. Spans routed while a Tier 2 Pod is restarting are immediately lost. In production environments, 2-level Resiliency must be explicitly enabled.

num_consumers is the number of goroutines that simultaneously consume spans from the queue; it is recommended to start at 2–3x the number of Tier 2 Pods. queue_size can be estimated as the number of incoming batches per second × expected maximum retry time in seconds. For example, in an environment with 100 batches per second and a maximum retry time of 10 seconds, 1000 is a reasonable baseline.

yaml

# tier1-collector-config.yaml (with Resiliency settings)
exporters:
  loadbalancing:
    routing_key: traceID
    resolver:
      dns:
        hostname: otelcol-tier2-headless.observability.svc.cluster.local
        interval: 5s
 
    # Resiliency Level 1: Loadbalancer-level retry
    queue:
      enabled: true
      num_consumers: 10     # Start at 2–3x the number of Tier 2 Pods
      queue_size: 1000      # Estimate as batches/sec × max retry time (seconds)
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
 
    protocol:
      otlp:
        timeout: 5s
        # Resiliency Level 2: Sub-exporter (per-connection) level retry
        retry_on_failure:
          enabled: true
          initial_interval: 1s
          max_interval: 10s
        sending_queue:
          enabled: true
          queue_size: 100
 
extensions:
  health_check:
    endpoint: 0.0.0.0:13133    # Expose /health endpoint

Note: With this configuration alone, the queue is reset on Collector restart. Because loadbalancingexporter does not support Persistent Queue, all sub-exporters share the same in-memory queue. To minimize the span loss window during Collector restarts, it is recommended to pair this with a fast recovery strategy for Tier 2 (RollingUpdate deployment, preStop hooks, etc.).

2-Level Resiliency Model: Level 2 (sub-exporter) handles connectivity issues with individual Tier 2 instances, while Level 1 (loadbalancer) manages retries from a whole-pipeline perspective. Configuring both levels together maximizes fault isolation.

The following configuration integrates the health check endpoint with Kubernetes liveness/readiness probes.

yaml

# tier1-deployment.yaml (probe section)
livenessProbe:
  httpGet:
    path: /health
    port: 13133
  initialDelaySeconds: 10
  periodSeconds: 15
readinessProbe:
  httpGet:
    path: /health
    port: 13133
  initialDelaySeconds: 5
  periodSeconds: 10

Example 4: Completing the Tier 2 Collector Configuration

The Tier 1 configuration alone does not complete the full pipeline. For actual tail sampling to occur in Tier 2, a configuration including tailsamplingprocessor is required.

What is Tail Sampling? A method where the sampling decision is made after all spans of a trace have been collected, by examining the full trace data. Condition-based policies such as latency threshold exceeded or error occurrence are possible, but the prerequisite is that all spans must be gathered on the same instance.

yaml

# tier2-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
 
processors:
  tail_sampling:
    # decision_wait: time to wait after receiving the last span before making a sampling decision
    # All spans of the trace are held in memory during this time
    decision_wait: 30s
    num_traces: 50000    # Maximum number of traces to hold in memory simultaneously
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 500
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10    # Remaining traces sampled at 10%
 
exporters:
  otlp:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [otlp]

decision_wait and the memory trade-off: A higher decision_wait value improves span collection completeness, but all traces are held in memory for that duration. Memory consumption equals traces per second × decision_wait (seconds) × average span size, so when the num_traces limit is exceeded, the oldest traces are evicted first. Advanced tuning of tailsamplingprocessor will be covered in detail in a future article.

Example 5: Real-Time Monitoring of Traffic Skew

You can monitor the variance in received span counts across Tier 2 instances using Prometheus metrics.

promql

# Span send rate per Tier 2 instance (over 5 minutes)
sum by (exporter) (
  rate(otelcol_exporter_sent_spans_total[5m])
)

How to verify labels: The actual exporter label values in your metrics may appear as loadbalancing/0, loadbalancing/1, or differ by environment. It is recommended to first check the actual label keys and values with label_values(otelcol_exporter_sent_spans_total, exporter) and then adjust the query accordingly.

If there is a large variance between instances, it is recommended to first check the following two things:

Check traceID generation method: Non-standard methods that generate sequential IDs can concentrate traffic on a specific node in the hash ring. A random 128-bit ID following the W3C Trace Context standard is recommended.
Check routing_key configuration: Review whether a routing_key different from your intent has been configured.

Pros and Cons Analysis

Advantages

Item	Detail
Tail sampling accuracy	All spans of the same traceID are gathered at the same Collector, enabling accurate sampling policy evaluation
Dynamic topology support	dns/k8s/aws_cloud_map resolvers automatically reflect Pod scaling events
Deterministic routing	Multiple Tier 1 instances with the same configuration always produce the same routing result
Multi-signal support	Can process traces, metrics, and logs; routing_key can be configured differently per signal type
Flexible routing keys	Three routing_key options — traceID / service / streamID — can be used selectively

Disadvantages and Caveats

Item	Detail	Mitigation
Unsafe defaults	retry/queue disabled by default; immediate data loss on failure	Explicitly enable 2-level retry/queue
No Persistent Queue support	All sub-exporters share the same in-memory queue; per-exporter persistent queues are not possible	Set sufficient queue size; pair with fast Tier 2 recovery strategy
DNS TTL delay	dns resolver introduces a delay of seconds to tens of seconds in reflecting topology changes	Optimize interval value or consider switching to k8s resolver
k8s resolver RBAC	Without EndpointSlice watch permissions, the backend list cannot be updated	Explicitly specify `endpointslices` permission in ClusterRole
Hash skew risk	Non-standard traceID generation (sequential IDs, etc.) can concentrate traffic on a specific Tier 2	Use W3C standard random 128-bit traceIDs
v0.141.0 dns:/// bug	Regression exists that disables gRPC-level DNS round-robin configuration	Recommended to use v0.145.0 or later

Most Common Mistakes in Practice

Failing to diagnose the cause after not configuring headless service: Using the dns resolver without clusterIP: None returns only a single ClusterIP, resulting in no load balancing whatsoever. If running nslookup otelcol-tier2-headless.observability.svc.cluster.local from inside a Pod after applying the configuration returns only a single IP, the headless configuration is missing. It is recommended to verify with kubectl get svc -n observability that the CLUSTER-IP column shows None.
Silent data loss from trusting retry/queue defaults: With default settings, a Tier 2 Pod restart alone causes spans worth seconds to tens of seconds to be silently dropped with no errors. If you see "Dropping data because sending_queue is full" or connection failure messages in Collector logs, that is a signal that Resiliency settings are missing. It is recommended to use a checklist to verify that 2-level Resiliency configuration is included before deploying to production.
Difficulty tracing symptoms after not applying k8s resolver RBAC: Without endpointslices permission in the ClusterRole, the resolver cannot update the backend list and continues routing only to existing Pod IPs. Symptoms may include traffic still being directed to old IPs even after Pods have been replaced. It is recommended to check resolver-related logs with kubectl logs <tier1-pod> -n observability | grep -i "endpoint\|resolver\|backend".

Conclusion

The loadbalancingexporter 2-tier architecture is a pattern recommended by the official OpenTelemetry documentation and Grafana that simultaneously achieves tail sampling accuracy and horizontal scalability.

Three steps you can take right now:

Inspect your current configuration: Check the resolver method with kubectl exec -it <tier1-pod> -n observability -- cat /conf/config.yaml, and verify with kubectl get svc -n observability that the CLUSTER-IP column for the Tier 2 service shows None. If it is a headless service, also run nslookup from inside a Pod to confirm that multiple A records are returned.
Upgrade version and enable Resiliency: If your current Collector version is around v0.141.0, it may be affected by the dns:/// regression bug, so upgrading to v0.145.0 or later is recommended. Then explicitly add retry_on_failure and queue settings to the Tier 1 configuration file and deploy. After deployment, you can verify that the resolver is correctly detecting backends by looking for messages in the form "backend <IP>:4317 added" in the Collector logs.
Add traffic skew monitoring: Add the query sum by (exporter) (rate(otelcol_exporter_sent_spans_total[5m])) to Prometheus and visualize it on a Grafana dashboard. First confirm the actual label values with label_values(otelcol_exporter_sent_spans_total, exporter) and then adjust the query. If the variance between instances exceeds 2x persistently, it is recommended to re-examine the traceID generation method or the routing_key configuration.

Next article: Deep dive into tailsamplingprocessor policies — designing composite policies based on latency, error rate, and attributes, and optimizing memory usage by tuning decision_wait

Appendix: Cloud Map Resolver for AWS ECS/Fargate Environments

In AWS ECS/Fargate environments outside of Kubernetes, the aws_cloud_map resolver can be used.

yaml

resolver:
  aws_cloud_map:
    namespace: "otel-collectors"
    service_name: "tier2-collector"
    port: 4317
    interval: 30s
    timeout: 5s

References

Complete Guide to loadbalancingexporter: Guaranteeing Tail Sampling Accuracy with a 2-Tier Architecture | DEV BAK - 기술블로그

DevOps

Complete Guide to loadbalancingexporter: Guaranteeing Tail Sampling Accuracy with a 2-Tier Architecture

Core Concepts

Consistent Hashing and loadbalancingexporter

What is Consistent Hashing? A distributed algorithm designed to minimize hash result changes when nodes are added or removed. Unlike standard modular hashing, it does not redistribute all keys when nodes are added or removed, making it resilient to topology changes.

Hash ring operation example: The diagram below shows how each traceID is assigned to its nearest node with 3 nodes, and how only some keys are redistributed when node-D is added.

Hash ring (3 nodes → adding node-D)
 
               node-A
             /         \
    traceID-1            traceID-3 ─── (after adding node-D → migrated to node-D)
          |                  |
       node-C              node-B
             \         /
              traceID-2
 
Adding node-D → only traceID-3 is redistributed; traceID-1 and traceID-2 remain on existing nodes

routing_key	Primary Use	Default Signal
`traceID`	Trace-based tail sampling	traces
`service`	Per-service load isolation	metrics
`streamID`	gRPC stream affinity	-

Why Standard K8s Services Fall Short

Overall 2-Tier Architecture

[Application / Agent]
         │
[Tier 1 Collector]  ← loadbalancingexporter (consistent hash by traceID)
         │
[Tier 2 Collector]  ← tailsamplingprocessor (same traceID → same instance guaranteed)
         │
[Backend (Jaeger / Grafana Tempo, etc.)]

Tier 1: Hashes received spans by traceID and forwards them to a specific Tier 2 instance. Freely horizontally scalable, holds almost no state.
Tier 2: All spans for the same traceID are gathered at the same instance, enabling accurate application of tail sampling policies. Spans are held in memory during decision_wait.

Key point: Tier 1 and Tier 2 must be separated into distinct Deployments. Combining them in the same Pod will interrupt data reception entirely during Tier 2 redeployment.

Practical Application

Example 1: Dynamically Resolving Pod IPs with DNS Resolver

First, declare the Tier 2 service as headless.

yaml

# tier2-headless-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: otelcol-tier2-headless
  namespace: observability
spec:
  clusterIP: None          # Key setting for headless service
  selector:
    app: otelcol-tier2
  ports:
    - name: otlp-grpc
      port: 4317
      targetPort: 4317

Next, configure the dns resolver on the Tier 1 Collector.

yaml

# tier1-collector-config.yaml
exporters:
  loadbalancing:
    routing_key: traceID
    protocol:
      otlp:
        timeout: 1s
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otelcol-tier2-headless.observability.svc.cluster.local
        port: 4317
        interval: 5s    # DNS re-query interval (time to reflect Pod scaling events)
        timeout: 1s
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [loadbalancing]   # connect with exporter named 'loadbalancing'

Note: Without clusterIP: None, DNS returns only a single ClusterIP, resulting in no load balancing at all. After applying the configuration, it is recommended to run kubectl exec -it <pod> -n observability -- nslookup otelcol-tier2-headless.observability.svc.cluster.local and verify that multiple A records are returned.

Configuration Item	Role	Recommended Value
`hostname`	Headless service FQDN	Full path including namespace recommended
`interval`	DNS re-query interval	5–30s (adjust based on topology change frequency)
`timeout`	DNS query timeout	1–3s
`clusterIP: None`	Headless declaration	Must be `None` to return Pod IPs

Example 2: Instantly Reflecting Topology Changes with k8s Resolver

yaml

# tier1-collector-config.yaml (k8s resolver version)
exporters:
  loadbalancing:
    routing_key: traceID
    protocol:
      otlp:
        timeout: 1s
        tls:
          insecure: true
    resolver:
      k8s:
        service: otelcol-tier2.observability
        ports:
          - 4317
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [loadbalancing]

yaml

# rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otelcol-tier1-role
rules:
  - apiGroups: ["discovery.k8s.io"]
    resources: ["endpointslices"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otelcol-tier1-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: otelcol-tier1-role
subjects:
  - kind: ServiceAccount
    name: otelcol-tier1
    namespace: observability

Comparison	dns resolver	k8s resolver
Reflection method	Periodic DNS polling	EndpointSlice watch (event-based)
Reflection speed	Seconds to tens of seconds depending on interval setting	Nearly immediate (within a few seconds)
Additional setup	Headless service required	RBAC required
K8s version dependency	None	1.21+ (EndpointSlice GA)

Example 3: Integrating 2-Level Resiliency with Health Checks

yaml

# tier1-collector-config.yaml (with Resiliency settings)
exporters:
  loadbalancing:
    routing_key: traceID
    resolver:
      dns:
        hostname: otelcol-tier2-headless.observability.svc.cluster.local
        interval: 5s
 
    # Resiliency Level 1: Loadbalancer-level retry
    queue:
      enabled: true
      num_consumers: 10     # Start at 2–3x the number of Tier 2 Pods
      queue_size: 1000      # Estimate as batches/sec × max retry time (seconds)
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s
 
    protocol:
      otlp:
        timeout: 5s
        # Resiliency Level 2: Sub-exporter (per-connection) level retry
        retry_on_failure:
          enabled: true
          initial_interval: 1s
          max_interval: 10s
        sending_queue:
          enabled: true
          queue_size: 100
 
extensions:
  health_check:
    endpoint: 0.0.0.0:13133    # Expose /health endpoint

Note: With this configuration alone, the queue is reset on Collector restart. Because loadbalancingexporter does not support Persistent Queue, all sub-exporters share the same in-memory queue. To minimize the span loss window during Collector restarts, it is recommended to pair this with a fast recovery strategy for Tier 2 (RollingUpdate deployment, preStop hooks, etc.).

2-Level Resiliency Model: Level 2 (sub-exporter) handles connectivity issues with individual Tier 2 instances, while Level 1 (loadbalancer) manages retries from a whole-pipeline perspective. Configuring both levels together maximizes fault isolation.

The following configuration integrates the health check endpoint with Kubernetes liveness/readiness probes.

yaml

# tier1-deployment.yaml (probe section)
livenessProbe:
  httpGet:
    path: /health
    port: 13133
  initialDelaySeconds: 10
  periodSeconds: 15
readinessProbe:
  httpGet:
    path: /health
    port: 13133
  initialDelaySeconds: 5
  periodSeconds: 10

Example 4: Completing the Tier 2 Collector Configuration

The Tier 1 configuration alone does not complete the full pipeline. For actual tail sampling to occur in Tier 2, a configuration including tailsamplingprocessor is required.

What is Tail Sampling? A method where the sampling decision is made after all spans of a trace have been collected, by examining the full trace data. Condition-based policies such as latency threshold exceeded or error occurrence are possible, but the prerequisite is that all spans must be gathered on the same instance.

yaml

# tier2-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
 
processors:
  tail_sampling:
    # decision_wait: time to wait after receiving the last span before making a sampling decision
    # All spans of the trace are held in memory during this time
    decision_wait: 30s
    num_traces: 50000    # Maximum number of traces to hold in memory simultaneously
    policies:
      - name: errors-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: latency-policy
        type: latency
        latency:
          threshold_ms: 500
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10    # Remaining traces sampled at 10%
 
exporters:
  otlp:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [otlp]

decision_wait and the memory trade-off: A higher decision_wait value improves span collection completeness, but all traces are held in memory for that duration. Memory consumption equals traces per second × decision_wait (seconds) × average span size, so when the num_traces limit is exceeded, the oldest traces are evicted first. Advanced tuning of tailsamplingprocessor will be covered in detail in a future article.

Example 5: Real-Time Monitoring of Traffic Skew

You can monitor the variance in received span counts across Tier 2 instances using Prometheus metrics.

promql

# Span send rate per Tier 2 instance (over 5 minutes)
sum by (exporter) (
  rate(otelcol_exporter_sent_spans_total[5m])
)

How to verify labels: The actual exporter label values in your metrics may appear as loadbalancing/0, loadbalancing/1, or differ by environment. It is recommended to first check the actual label keys and values with label_values(otelcol_exporter_sent_spans_total, exporter) and then adjust the query accordingly.

If there is a large variance between instances, it is recommended to first check the following two things:

Check traceID generation method: Non-standard methods that generate sequential IDs can concentrate traffic on a specific node in the hash ring. A random 128-bit ID following the W3C Trace Context standard is recommended.
Check routing_key configuration: Review whether a routing_key different from your intent has been configured.

Pros and Cons Analysis

Advantages

Item	Detail
Tail sampling accuracy	All spans of the same traceID are gathered at the same Collector, enabling accurate sampling policy evaluation
Dynamic topology support	dns/k8s/aws_cloud_map resolvers automatically reflect Pod scaling events
Deterministic routing	Multiple Tier 1 instances with the same configuration always produce the same routing result
Multi-signal support	Can process traces, metrics, and logs; routing_key can be configured differently per signal type
Flexible routing keys	Three routing_key options — traceID / service / streamID — can be used selectively

Disadvantages and Caveats

Item	Detail	Mitigation
Unsafe defaults	retry/queue disabled by default; immediate data loss on failure	Explicitly enable 2-level retry/queue
No Persistent Queue support	All sub-exporters share the same in-memory queue; per-exporter persistent queues are not possible	Set sufficient queue size; pair with fast Tier 2 recovery strategy
DNS TTL delay	dns resolver introduces a delay of seconds to tens of seconds in reflecting topology changes	Optimize interval value or consider switching to k8s resolver
k8s resolver RBAC	Without EndpointSlice watch permissions, the backend list cannot be updated	Explicitly specify `endpointslices` permission in ClusterRole
Hash skew risk	Non-standard traceID generation (sequential IDs, etc.) can concentrate traffic on a specific Tier 2	Use W3C standard random 128-bit traceIDs
v0.141.0 dns:/// bug	Regression exists that disables gRPC-level DNS round-robin configuration	Recommended to use v0.145.0 or later

Most Common Mistakes in Practice

Failing to diagnose the cause after not configuring headless service: Using the dns resolver without clusterIP: None returns only a single ClusterIP, resulting in no load balancing whatsoever. If running nslookup otelcol-tier2-headless.observability.svc.cluster.local from inside a Pod after applying the configuration returns only a single IP, the headless configuration is missing. It is recommended to verify with kubectl get svc -n observability that the CLUSTER-IP column shows None.
Silent data loss from trusting retry/queue defaults: With default settings, a Tier 2 Pod restart alone causes spans worth seconds to tens of seconds to be silently dropped with no errors. If you see "Dropping data because sending_queue is full" or connection failure messages in Collector logs, that is a signal that Resiliency settings are missing. It is recommended to use a checklist to verify that 2-level Resiliency configuration is included before deploying to production.
Difficulty tracing symptoms after not applying k8s resolver RBAC: Without endpointslices permission in the ClusterRole, the resolver cannot update the backend list and continues routing only to existing Pod IPs. Symptoms may include traffic still being directed to old IPs even after Pods have been replaced. It is recommended to check resolver-related logs with kubectl logs <tier1-pod> -n observability | grep -i "endpoint\|resolver\|backend".

Conclusion

Three steps you can take right now:

Inspect your current configuration: Check the resolver method with kubectl exec -it <tier1-pod> -n observability -- cat /conf/config.yaml, and verify with kubectl get svc -n observability that the CLUSTER-IP column for the Tier 2 service shows None. If it is a headless service, also run nslookup from inside a Pod to confirm that multiple A records are returned.
Upgrade version and enable Resiliency: If your current Collector version is around v0.141.0, it may be affected by the dns:/// regression bug, so upgrading to v0.145.0 or later is recommended. Then explicitly add retry_on_failure and queue settings to the Tier 1 configuration file and deploy. After deployment, you can verify that the resolver is correctly detecting backends by looking for messages in the form "backend <IP>:4317 added" in the Collector logs.
Add traffic skew monitoring: Add the query sum by (exporter) (rate(otelcol_exporter_sent_spans_total[5m])) to Prometheus and visualize it on a Grafana dashboard. First confirm the actual label values with label_values(otelcol_exporter_sent_spans_total, exporter) and then adjust the query. If the variance between instances exceeds 2x persistently, it is recommended to re-examine the traceID generation method or the routing_key configuration.

Next article: Deep dive into tailsamplingprocessor policies — designing composite policies based on latency, error rate, and attributes, and optimizing memory usage by tuning decision_wait

Appendix: Cloud Map Resolver for AWS ECS/Fargate Environments

In AWS ECS/Fargate environments outside of Kubernetes, the aws_cloud_map resolver can be used.

yaml

resolver:
  aws_cloud_map:
    namespace: "otel-collectors"
    service_name: "tier2-collector"
    port: 4317
    interval: 30s
    timeout: 5s

Core Concepts

Consistent Hashing and loadbalancingexporter

Why Standard K8s Services Fall Short

Overall 2-Tier Architecture

Practical Application

Example 1: Dynamically Resolving Pod IPs with DNS Resolver

Example 2: Instantly Reflecting Topology Changes with k8s Resolver

Example 3: Integrating 2-Level Resiliency with Health Checks

Example 4: Completing the Tier 2 Collector Configuration

Example 5: Real-Time Monitoring of Traffic Skew

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Conclusion

Appendix: Cloud Map Resolver for AWS ECS/Fargate Environments

References

Core Concepts

Consistent Hashing and loadbalancingexporter

Why Standard K8s Services Fall Short

Overall 2-Tier Architecture

Practical Application

Example 1: Dynamically Resolving Pod IPs with DNS Resolver

Example 2: Instantly Reflecting Topology Changes with k8s Resolver

Example 3: Integrating 2-Level Resiliency with Health Checks

Example 4: Completing the Tier 2 Collector Configuration

Example 5: Real-Time Monitoring of Traffic Skew

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Conclusion

Appendix: Cloud Map Resolver for AWS ECS/Fargate Environments

References

Recommended Posts

OpenTelemetry Tail Sampling Deep Dive: Composite Policy Design and Memory Optimization with decision_wait

Tail Sampling + KEDA: A 2-Tier OTel Architecture That Never Misses a Trace During Traffic Spikes

OTel spanmetrics Connector: How to Auto-Generate RED Metrics from Traces Without Code Changes and Connect to Grafana

OTel Collector Tail Sampling Memory Optimization: A Configuration Guide for `decision_wait` and `num_traces` to Prevent Production OOM

OpenTelemetry Operator + HPA: 2-Layer Gateway Pattern for Preserving Tail Sampling Accuracy

OpenTelemetry Collector Horizontal Scaling: Maintaining Tail Sampling Accuracy with Agent-Gateway Architecture and loadbalancingexporter