How to Automatically Validate Agent Quality in CI with LLM-as-Judge and OpenTelemetry

You've probably had the experience of finding out — through a user complaint — that response quality suddenly dropped after a minor prompt change. I have too. I put an agent into production, assumed it would "just work," and then a few days later got a Slack message from the CS team saying "the responses seem off," sending me scrambling through logs. Unlike ordinary software, LLM-based agents produce probabilistic outputs, depend on external model APIs, and have opaque internal reasoning paths. That's why conventional testing approaches will always hit a wall.

This article is aimed at anyone who has built at least one agent that calls an LLM API. It covers how to automatically score response quality with LLM-as-Judge, trace latency bottlenecks at the span level with OpenTelemetry, and wire both into CI/CD so that every prompt change is automatically gated on quality — including real code and tool selection criteria. We'll also look at what roles RAGAS, DeepEval, Langfuse, and Arize Phoenix each play.

The industry is already moving fast. According to LangChain's 2025 State of AI Agents survey, more than half (53.3%) of teams that have deployed agents to production are already running LLM-as-Judge. By the end of this article, you'll be able to attach automatic quality gating to a RAG (Retrieval-Augmented Generation) pipeline and validate tool call ordering in multi-step agents.

Core Concepts

What is an Agent Eval Pipeline

Verifying that an agent is working "correctly" is a different beast from simple unit testing. You're not just looking at output accuracy — you need to measure multi-turn reasoning paths, tool call ordering, response latency distributions, and even token costs. An agent eval pipeline is the structure that automates all of this and systematically collects and analyzes the results.

The big picture of the pipeline looks like this:

java

User request
    ↓
Agent execution (tool calls, reasoning, response generation)
    ↓
Trace collection (OpenTelemetry → Langfuse / Arize Phoenix)
    ↓
Judge scoring (LLM-as-Judge → quality metrics)
    ↓
CI gating (block merge if below threshold)
    ↓
Production dashboard (Grafana / Datadog)

In practice, the most realistic starting point is to run offline eval (against a pre-built golden dataset) and online eval (sampling production traffic) as separate processes. Examples 1–4 below correspond to each stage of this pipeline. Example 1 is Judge scoring, Example 2 is trace collection, Example 3 is multi-step agent validation, and Example 4 is CI gating.

LLM-as-Judge: Having a Model Score Another Model

LLM-as-Judge is an approach where a judge LLM evaluates the output of another LLM. Compared to human labeling, it's 500–5,000× cheaper, and while conditions vary by task type and judge model, a well-designed judge is known to achieve around 80% agreement with human preferences. It's a practical response to the real-world dilemma of "we can't afford expensive human evaluators, but we don't want to give up on measuring quality."

Judges fall into four types depending on what's being evaluated:

Type	Role	Typical Use Case
Judge for Models	Compare response quality across models	Model swap decisions
Judge for Data	Validate training/evaluation dataset quality	Data cleaning before fine-tuning
Judge for Agents	Evaluate reasoning path and tool call appropriateness	Agent regression prevention
Judge for Reasoning	Verify chain-of-thought (CoT) consistency	Complex reasoning tasks

LLM-as-Judge: The idea of "AI evaluating AI" feels strange at first, but the key is how clearly you design the judge prompt. Judges carry their own training data and biases, so blindly trusting them without bias correction is dangerous.

But judges have pitfalls too. Three major biases quantified in a 2025 LLM-as-Judge bias study (ScienceDirect) are worth knowing for anyone doing this in practice:

Position Bias: A tendency to favor answers presented first. Disagreement can reach up to 40%. Honestly, I didn't know about this bias at first and trusted scores at face value — then I was caught off guard when swapping the order caused scores to shift dramatically.
Verbosity Bias: Longer, more detailed answers receive about 15% over-scoring. This can be mitigated by explicitly stating "evaluate accuracy only, regardless of length" in the judge prompt (rubric).
Self-Enhancement Bias: A 5–7% preference for responses from the same model family. This is what happens when you score with GPT-4o and GPT-family responses get an edge.

Rubric: The scoring criteria included in the judge prompt. Being specific — like "evaluate accuracy, conciseness, and contextual fit each as 0/1" rather than "choose a score from 1 to 5" — reduces bias.

Latency Tracing: Dissecting Bottlenecks at the Span Level

Agents that call LLM APIs hide latency bottlenecks in multiple places. Is the retriever slow? Is the LLM itself the bottleneck? Is it the response formatting? Looking only at total elapsed time makes it impossible to pinpoint the cause. Latency tracing breaks the entire request path into spans and records the delay at each stage in a structured way.

Three key metrics worth knowing:

Metric	Meaning	Practical Target
TTFT (Time to First Token)	Time until the first token arrives	p95 ≤ 0.6 s for chat UIs
TPOT (Time Per Output Token)	Token generation speed	Determines streaming experience quality
E2E Latency	Full request-response cycle	Set SLO per service type

A concept gaining traction since 2025 is Goodput. Rather than simply optimizing tokens per second (throughput), the idea is to use "the fraction of requests that simultaneously satisfy TTFT and TPOT SLOs" as the KPI. Expressed as a formula: Goodput = requests satisfying SLO / total requests. With NVIDIA, Anyscale, and BentoML adopting it as a core metric, it's spreading across the industry.

Goodput: Even if a server generates tokens quickly, it means nothing if it doesn't satisfy users' perceived SLOs. It's a concept that unifies engineering metrics and user experience metrics into one.

The industry standard has settled on OpenTelemetry's GenAI semantic conventions. Proposed in late 2024, this convention has been integrated into major tools including Langfuse (open-source LLM observability platform), Arize Phoenix (agent trajectory visualization tool), and Traceloop. The instrumentation overhead is under 1 ms per call — negligible compared to LLM API latency (100 ms to 30 s).

Practical Application

Example 1: RAG Quality Gating — Automatic Quality Checks on Every PR

There are common problems in RAG systems. You tweak the prompt slightly and suddenly get answers that contradict the retrieved content, or irrelevant documents get included as context. You can catch these automatically on every PR.

This structure uses RAGAS (a framework specialized for RAG evaluation) to measure four metrics and blocks deployment if any fall below the threshold.

python

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset
 
# Sample data — in practice, build this by selecting "good" cases from production logs
questions = ["What is your return policy?", "How long does shipping take?"]
generated_answers = [
    "Returns are accepted within 30 days of purchase.",
    "Standard shipping takes 3–5 business days.",
]
retrieved_contexts = [
    ["Our return policy allows returns within 30 days of purchase..."],
    ["Standard shipping takes 3–5 business days..."],
]
ground_truths = ["Returns accepted within 30 days of purchase", "3–5 business days"]
 
eval_dataset = Dataset.from_dict({
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,
    "ground_truth": ground_truths,
})
 
result = evaluate(
    eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
 
# Threshold check — CI failure blocks deployment if below threshold
thresholds = {
    "faithfulness": 0.90,
    "answer_relevancy": 0.85,
    "context_precision": 0.80,
    "context_recall": 0.80,
}
 
for metric, threshold in thresholds.items():
    score = result[metric]
    if score < threshold:
        raise SystemExit(f"❌ {metric} = {score:.3f} < {threshold} — deployment blocked")
 
print("✅ All quality metrics passed")

What each metric means:

Metric	What It Measures	Low Score Suggests
Faithfulness	Is the generated answer faithful to the retrieved context?	Potential hallucination
Answer Relevancy	How relevant is the answer to the question?	Answer drifting off-topic
Context Precision	What fraction of retrieved documents were actually needed?	Retriever over-fetching
Context Recall	Was the necessary information retrieved?	Retriever missing content

Attaching eval results as trace annotations in Langfuse makes it much easier to later analyze which question types are scoring low.

Example 2: LLM Span Instrumentation Setup — Tracing Every Call with OpenTelemetry

Honestly, when I first tried to connect OpenTelemetry to LLMs, I wasn't sure where to start. It's simpler than you'd think. The key is using the exact attribute names defined in the GenAI semantic conventions.

python

from openai import OpenAI
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
 
# Tracer initialization
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.set_tracer_provider(provider)
 
tracer = trace.get_tracer("agent-pipeline")
llm_client = OpenAI()
 
def call_llm_with_tracing(prompt: str, model: str = "gpt-4o") -> str:
    with tracer.start_as_current_span("llm-call") as span:
        # GenAI semantic convention attributes
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.max_tokens", 1024)
 
        response = llm_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
        )
 
        # Record token usage
        span.set_attribute(
            "gen_ai.usage.input_tokens",
            response.usage.prompt_tokens
        )
        span.set_attribute(
            "gen_ai.usage.output_tokens",
            response.usage.completion_tokens
        )
 
        return response.choices[0].message.content

If writing this manually every time feels tedious, openllmetry (an OpenTelemetry-based LLM auto-instrumentation library maintained by Traceloop) is worth trying. It automatically instruments major libraries including OpenAI, LangChain, and LlamaIndex.

python

from traceloop.sdk import Traceloop
 
Traceloop.init(app_name="my-agent", disable_batch=False)
# All subsequent LLM calls will automatically generate spans

Seeing each LLM call's token count and latency visualized at the span level in the Langfuse dashboard for the first time is quite impressive. You can see at a glance which stage is consuming the most time.

Example 3: Multi-Step Agent Tool Call Validation — Automatically Verifying Order and Correctness

When an agent must operate in the sequence CRM lookup → ticket creation → email send, DeepEval's (an LLM evaluation framework developed by Confident AI) agent metrics help automatically verify that the order is correct.

Having used this on a real team, tool call ordering errors turned out to happen more often than expected. There were cases where the call order would get reversed after a minor prompt edit — without automatic validation, we would have caught it much later.

python

from deepeval import evaluate
from deepeval.metrics import AgentTaskCompletionMetric, ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase
 
# Collect trajectory after agent execution
test_case = LLMTestCase(
    input="Process a refund request for customer ID 1234",
    actual_output=agent_response,
    # Actual tool call sequence (verify parameter names against your DeepEval API version)
    actual_tool_calls_made=["get_customer_crm", "create_ticket", "send_email"],
    # Expected tool call sequence
    expected_tools=["get_customer_crm", "create_ticket", "send_email"],
)
 
task_metric = AgentTaskCompletionMetric(threshold=0.8, model="gpt-4o")
tool_metric = ToolCorrectnessMetric(threshold=0.9)
 
evaluate([test_case], metrics=[task_metric, tool_metric])

Execution Trajectory: The complete record of tool call ordering, arguments, and intermediate results the agent chose while performing a task. It explains "why that result occurred" far better than looking at the final response alone.

Adding logic to route failure cases to a human review queue in Langfuse with a needs_human_review tag lets you strike a balance between full automation and human oversight.

Example 4: Inserting an Eval Step into the CI Pipeline — Automatic Scoring on Every Prompt Change

This is a workflow that automatically runs judge scoring against 100 golden dataset cases every time a prompt-editing PR is opened.

yaml

# .github/workflows/llm-eval.yml
name: LLM Eval CI
 
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/agent/**'
 
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
 
      - name: Install dependencies
        run: pip install deepeval ragas
 
      - name: Run LLM Evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          deepeval test run tests/eval/ \
            --min-success-rate 0.85 \
            --model gpt-4o-mini
 
      - name: Upload eval results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval_results.json

If the success rate falls below 85%, CI fails and the merge is blocked. Using gpt-4o-mini as the judge significantly cuts costs — scoring 100 cases runs about $0.01–$0.05 (as of writing; subject to change with model pricing). Using the paths: trigger to limit eval runs to only PRs that change prompt files also avoids unnecessary costs.

Pros and Cons Analysis

Advantages

Item	Details
Cost efficiency	500–5,000× cheaper than human labeling, executable immediately
Subjective quality evaluation	Can measure quality dimensions hard to capture with rules — tone, fluency, contextual fit
Flexible evaluation criteria	Adding a new criterion only requires editing the judge prompt
Automation integration	Insert into CI/CD to automatically detect regressions on every prompt or model change
Tracing overhead	OpenTelemetry instrumentation overhead under 1 ms — negligible impact on live services

Disadvantages and Caveats

Item	Details	Mitigation
Position Bias	Favors answers presented first (up to 40% disagreement)	Evaluate twice with swapped order, then average
Verbosity Bias	Over-scores longer, more detailed answers (~15%)	Explicitly state "evaluate accuracy only, regardless of length" in the rubric
Self-Enhancement Bias	Prefers responses from the same model family (5–7%)	Cross-validate with heterogeneous judges (e.g., GPT-4o + Claude)
Preference Leakage	The subject of evaluation may be present in the judge's training data	Conduct periodic data contamination audits
Judge Drift	The judge itself may change in performance over time	Regularly conduct meta-evaluation of judge-human agreement rates
Sampling costs	100% sampling in high-load environments creates storage and cost burdens	Use 5–10% sampling in production, expand on anomaly detection

The Most Common Mistakes in Practice

Trusting scores at face value without correcting for judge bias. The same content scored with only the response order swapped can differ by up to 40%. At a minimum, applying order-swap averaging is recommended.
Monitoring only p50 latency. The average may look fine while p99 contains requests taking 30 seconds each. Setting SLOs at the p95/p99 level better reflects actual user experience.
Building offline eval only and skipping production monitoring. Even if the golden set passes, real traffic brings different types of questions. Running online sampling eval in parallel lets you quickly detect post-deployment drift.

Closing Thoughts

The core of an agent eval pipeline is building a structure that detects when something has gone wrong before your users do. Trying to build a perfect pipeline from day one will actually prevent you from starting anything. A gradual, iterative approach is far more realistic.

Three steps you can take right now:

Create a free Langfuse cloud account and connect openllmetry to your existing LLM calls. After pip install traceloop-sdk, a single line of Traceloop.init() is all it takes for traces to start being collected automatically.
If you have a RAG system or agent, start by building a golden dataset of 20–50 cases. The fastest approach is picking "good" cases from your current production logs. Measuring a baseline score with RAGAS or DeepEval gives you an objective way to compare whether future changes are improvements or regressions.
Add a deepeval test run step to GitHub Actions and configure it to run only on PRs that change prompt files. For the success rate threshold, start low (70–75%) and gradually raise it as your team gets comfortable.

References

Evaluation Methodology

Tools

CI Integration

Performance Metrics and Observability

How to Automatically Validate Agent Quality in CI with LLM-as-Judge and OpenTelemetry | DEV BAK - 기술블로그

DevOps

How to Automatically Validate Agent Quality in CI with LLM-as-Judge and OpenTelemetry

Core Concepts

What is an Agent Eval Pipeline

The big picture of the pipeline looks like this:

java

User request
    ↓
Agent execution (tool calls, reasoning, response generation)
    ↓
Trace collection (OpenTelemetry → Langfuse / Arize Phoenix)
    ↓
Judge scoring (LLM-as-Judge → quality metrics)
    ↓
CI gating (block merge if below threshold)
    ↓
Production dashboard (Grafana / Datadog)

LLM-as-Judge: Having a Model Score Another Model

Judges fall into four types depending on what's being evaluated:

Type	Role	Typical Use Case
Judge for Models	Compare response quality across models	Model swap decisions
Judge for Data	Validate training/evaluation dataset quality	Data cleaning before fine-tuning
Judge for Agents	Evaluate reasoning path and tool call appropriateness	Agent regression prevention
Judge for Reasoning	Verify chain-of-thought (CoT) consistency	Complex reasoning tasks

LLM-as-Judge: The idea of "AI evaluating AI" feels strange at first, but the key is how clearly you design the judge prompt. Judges carry their own training data and biases, so blindly trusting them without bias correction is dangerous.

But judges have pitfalls too. Three major biases quantified in a 2025 LLM-as-Judge bias study (ScienceDirect) are worth knowing for anyone doing this in practice:

Position Bias: A tendency to favor answers presented first. Disagreement can reach up to 40%. Honestly, I didn't know about this bias at first and trusted scores at face value — then I was caught off guard when swapping the order caused scores to shift dramatically.
Verbosity Bias: Longer, more detailed answers receive about 15% over-scoring. This can be mitigated by explicitly stating "evaluate accuracy only, regardless of length" in the judge prompt (rubric).
Self-Enhancement Bias: A 5–7% preference for responses from the same model family. This is what happens when you score with GPT-4o and GPT-family responses get an edge.

Rubric: The scoring criteria included in the judge prompt. Being specific — like "evaluate accuracy, conciseness, and contextual fit each as 0/1" rather than "choose a score from 1 to 5" — reduces bias.

Latency Tracing: Dissecting Bottlenecks at the Span Level

Three key metrics worth knowing:

Metric	Meaning	Practical Target
TTFT (Time to First Token)	Time until the first token arrives	p95 ≤ 0.6 s for chat UIs
TPOT (Time Per Output Token)	Token generation speed	Determines streaming experience quality
E2E Latency	Full request-response cycle	Set SLO per service type

Goodput: Even if a server generates tokens quickly, it means nothing if it doesn't satisfy users' perceived SLOs. It's a concept that unifies engineering metrics and user experience metrics into one.

Practical Application

Example 1: RAG Quality Gating — Automatic Quality Checks on Every PR

This structure uses RAGAS (a framework specialized for RAG evaluation) to measure four metrics and blocks deployment if any fall below the threshold.

python

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset
 
# Sample data — in practice, build this by selecting "good" cases from production logs
questions = ["What is your return policy?", "How long does shipping take?"]
generated_answers = [
    "Returns are accepted within 30 days of purchase.",
    "Standard shipping takes 3–5 business days.",
]
retrieved_contexts = [
    ["Our return policy allows returns within 30 days of purchase..."],
    ["Standard shipping takes 3–5 business days..."],
]
ground_truths = ["Returns accepted within 30 days of purchase", "3–5 business days"]
 
eval_dataset = Dataset.from_dict({
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,
    "ground_truth": ground_truths,
})
 
result = evaluate(
    eval_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
 
# Threshold check — CI failure blocks deployment if below threshold
thresholds = {
    "faithfulness": 0.90,
    "answer_relevancy": 0.85,
    "context_precision": 0.80,
    "context_recall": 0.80,
}
 
for metric, threshold in thresholds.items():
    score = result[metric]
    if score < threshold:
        raise SystemExit(f"❌ {metric} = {score:.3f} < {threshold} — deployment blocked")
 
print("✅ All quality metrics passed")

What each metric means:

Metric	What It Measures	Low Score Suggests
Faithfulness	Is the generated answer faithful to the retrieved context?	Potential hallucination
Answer Relevancy	How relevant is the answer to the question?	Answer drifting off-topic
Context Precision	What fraction of retrieved documents were actually needed?	Retriever over-fetching
Context Recall	Was the necessary information retrieved?	Retriever missing content

Attaching eval results as trace annotations in Langfuse makes it much easier to later analyze which question types are scoring low.

Example 2: LLM Span Instrumentation Setup — Tracing Every Call with OpenTelemetry

python

from openai import OpenAI
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
 
# Tracer initialization
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.set_tracer_provider(provider)
 
tracer = trace.get_tracer("agent-pipeline")
llm_client = OpenAI()
 
def call_llm_with_tracing(prompt: str, model: str = "gpt-4o") -> str:
    with tracer.start_as_current_span("llm-call") as span:
        # GenAI semantic convention attributes
        span.set_attribute("gen_ai.system", "openai")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.max_tokens", 1024)
 
        response = llm_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
        )
 
        # Record token usage
        span.set_attribute(
            "gen_ai.usage.input_tokens",
            response.usage.prompt_tokens
        )
        span.set_attribute(
            "gen_ai.usage.output_tokens",
            response.usage.completion_tokens
        )
 
        return response.choices[0].message.content

python

from traceloop.sdk import Traceloop
 
Traceloop.init(app_name="my-agent", disable_batch=False)
# All subsequent LLM calls will automatically generate spans

Example 3: Multi-Step Agent Tool Call Validation — Automatically Verifying Order and Correctness

python

from deepeval import evaluate
from deepeval.metrics import AgentTaskCompletionMetric, ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase
 
# Collect trajectory after agent execution
test_case = LLMTestCase(
    input="Process a refund request for customer ID 1234",
    actual_output=agent_response,
    # Actual tool call sequence (verify parameter names against your DeepEval API version)
    actual_tool_calls_made=["get_customer_crm", "create_ticket", "send_email"],
    # Expected tool call sequence
    expected_tools=["get_customer_crm", "create_ticket", "send_email"],
)
 
task_metric = AgentTaskCompletionMetric(threshold=0.8, model="gpt-4o")
tool_metric = ToolCorrectnessMetric(threshold=0.9)
 
evaluate([test_case], metrics=[task_metric, tool_metric])

Execution Trajectory: The complete record of tool call ordering, arguments, and intermediate results the agent chose while performing a task. It explains "why that result occurred" far better than looking at the final response alone.

Adding logic to route failure cases to a human review queue in Langfuse with a needs_human_review tag lets you strike a balance between full automation and human oversight.

Example 4: Inserting an Eval Step into the CI Pipeline — Automatic Scoring on Every Prompt Change

This is a workflow that automatically runs judge scoring against 100 golden dataset cases every time a prompt-editing PR is opened.

yaml

# .github/workflows/llm-eval.yml
name: LLM Eval CI
 
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/agent/**'
 
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
 
      - name: Install dependencies
        run: pip install deepeval ragas
 
      - name: Run LLM Evals
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          deepeval test run tests/eval/ \
            --min-success-rate 0.85 \
            --model gpt-4o-mini
 
      - name: Upload eval results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval_results.json

Pros and Cons Analysis

Advantages

Item	Details
Cost efficiency	500–5,000× cheaper than human labeling, executable immediately
Subjective quality evaluation	Can measure quality dimensions hard to capture with rules — tone, fluency, contextual fit
Flexible evaluation criteria	Adding a new criterion only requires editing the judge prompt
Automation integration	Insert into CI/CD to automatically detect regressions on every prompt or model change
Tracing overhead	OpenTelemetry instrumentation overhead under 1 ms — negligible impact on live services

Disadvantages and Caveats

Item	Details	Mitigation
Position Bias	Favors answers presented first (up to 40% disagreement)	Evaluate twice with swapped order, then average
Verbosity Bias	Over-scores longer, more detailed answers (~15%)	Explicitly state "evaluate accuracy only, regardless of length" in the rubric
Self-Enhancement Bias	Prefers responses from the same model family (5–7%)	Cross-validate with heterogeneous judges (e.g., GPT-4o + Claude)
Preference Leakage	The subject of evaluation may be present in the judge's training data	Conduct periodic data contamination audits
Judge Drift	The judge itself may change in performance over time	Regularly conduct meta-evaluation of judge-human agreement rates
Sampling costs	100% sampling in high-load environments creates storage and cost burdens	Use 5–10% sampling in production, expand on anomaly detection

The Most Common Mistakes in Practice

Trusting scores at face value without correcting for judge bias. The same content scored with only the response order swapped can differ by up to 40%. At a minimum, applying order-swap averaging is recommended.
Monitoring only p50 latency. The average may look fine while p99 contains requests taking 30 seconds each. Setting SLOs at the p95/p99 level better reflects actual user experience.
Building offline eval only and skipping production monitoring. Even if the golden set passes, real traffic brings different types of questions. Running online sampling eval in parallel lets you quickly detect post-deployment drift.

Closing Thoughts

Three steps you can take right now:

Create a free Langfuse cloud account and connect openllmetry to your existing LLM calls. After pip install traceloop-sdk, a single line of Traceloop.init() is all it takes for traces to start being collected automatically.
If you have a RAG system or agent, start by building a golden dataset of 20–50 cases. The fastest approach is picking "good" cases from your current production logs. Measuring a baseline score with RAGAS or DeepEval gives you an objective way to compare whether future changes are improvements or regressions.
Add a deepeval test run step to GitHub Actions and configure it to run only on PRs that change prompt files. For the success rate threshold, start low (70–75%) and gradually raise it as your team gets comfortable.

References

Evaluation Methodology

Tools

CI Integration

Performance Metrics and Observability

Core Concepts

What is an Agent Eval Pipeline

LLM-as-Judge: Having a Model Score Another Model

Latency Tracing: Dissecting Bottlenecks at the Span Level

Practical Application

Example 1: RAG Quality Gating — Automatic Quality Checks on Every PR

Example 2: LLM Span Instrumentation Setup — Tracing Every Call with OpenTelemetry

Example 3: Multi-Step Agent Tool Call Validation — Automatically Verifying Order and Correctness

Example 4: Inserting an Eval Step into the CI Pipeline — Automatic Scoring on Every Prompt Change

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

What is an Agent Eval Pipeline

LLM-as-Judge: Having a Model Score Another Model

Latency Tracing: Dissecting Bottlenecks at the Span Level

Practical Application

Example 1: RAG Quality Gating — Automatic Quality Checks on Every PR

Example 2: LLM Span Instrumentation Setup — Tracing Every Call with OpenTelemetry

Example 3: Multi-Step Agent Tool Call Validation — Automatically Verifying Order and Correctness

Example 4: Inserting an Eval Step into the CI Pipeline — Automatic Scoring on Every Prompt Change

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Argo CD Multi-Cluster Secret Management: Sealed Secrets and External Secrets Operator in Practice

Automatically Syncing EKS Multi-Cluster Secrets Without Vault — AWS Secrets Manager + IRSA + ESO in Practice

Implementing Secrets Manager Multi-Tenant Isolation from a Single IAM Role with EKS Pod Identity + ABAC

Catching Korean PII with Presidio Custom Recognizers — Implementing Triple Verification with Regex + Checksum + Context

Enhancing Korean PII Detection with Presidio + KLUE-BERT — A Practical Guide Beyond the Limits of spaCy NER

Building a Korean PII Detection Pipeline with Presidio + spaCy