How to Automatically Validate Agent Quality in CI with LLM-as-Judge and OpenTelemetry
You've probably had the experience of finding out — through a user complaint — that response quality suddenly dropped after a minor prompt change. I have too. I put an agent into production, assumed it would "just work," and then a few days later got a Slack message from the CS team saying "the responses seem off," sending me scrambling through logs. Unlike ordinary software, LLM-based agents produce probabilistic outputs, depend on external model APIs, and have opaque internal reasoning paths. That's why conventional testing approaches will always hit a wall.
This article is aimed at anyone who has built at least one agent that calls an LLM API. It covers how to automatically score response quality with LLM-as-Judge, trace latency bottlenecks at the span level with OpenTelemetry, and wire both into CI/CD so that every prompt change is automatically gated on quality — including real code and tool selection criteria. We'll also look at what roles RAGAS, DeepEval, Langfuse, and Arize Phoenix each play.
The industry is already moving fast. According to LangChain's 2025 State of AI Agents survey, more than half (53.3%) of teams that have deployed agents to production are already running LLM-as-Judge. By the end of this article, you'll be able to attach automatic quality gating to a RAG (Retrieval-Augmented Generation) pipeline and validate tool call ordering in multi-step agents.
Core Concepts
What is an Agent Eval Pipeline
Verifying that an agent is working "correctly" is a different beast from simple unit testing. You're not just looking at output accuracy — you need to measure multi-turn reasoning paths, tool call ordering, response latency distributions, and even token costs. An agent eval pipeline is the structure that automates all of this and systematically collects and analyzes the results.
The big picture of the pipeline looks like this:
User request
↓
Agent execution (tool calls, reasoning, response generation)
↓
Trace collection (OpenTelemetry → Langfuse / Arize Phoenix)
↓
Judge scoring (LLM-as-Judge → quality metrics)
↓
CI gating (block merge if below threshold)
↓
Production dashboard (Grafana / Datadog)In practice, the most realistic starting point is to run offline eval (against a pre-built golden dataset) and online eval (sampling production traffic) as separate processes. Examples 1–4 below correspond to each stage of this pipeline. Example 1 is Judge scoring, Example 2 is trace collection, Example 3 is multi-step agent validation, and Example 4 is CI gating.
LLM-as-Judge: Having a Model Score Another Model
LLM-as-Judge is an approach where a judge LLM evaluates the output of another LLM. Compared to human labeling, it's 500–5,000× cheaper, and while conditions vary by task type and judge model, a well-designed judge is known to achieve around 80% agreement with human preferences. It's a practical response to the real-world dilemma of "we can't afford expensive human evaluators, but we don't want to give up on measuring quality."
Judges fall into four types depending on what's being evaluated:
| Type | Role | Typical Use Case |
|---|---|---|
| Judge for Models | Compare response quality across models | Model swap decisions |
| Judge for Data | Validate training/evaluation dataset quality | Data cleaning before fine-tuning |
| Judge for Agents | Evaluate reasoning path and tool call appropriateness | Agent regression prevention |
| Judge for Reasoning | Verify chain-of-thought (CoT) consistency | Complex reasoning tasks |
LLM-as-Judge: The idea of "AI evaluating AI" feels strange at first, but the key is how clearly you design the judge prompt. Judges carry their own training data and biases, so blindly trusting them without bias correction is dangerous.
But judges have pitfalls too. Three major biases quantified in a 2025 LLM-as-Judge bias study (ScienceDirect) are worth knowing for anyone doing this in practice:
- Position Bias: A tendency to favor answers presented first. Disagreement can reach up to 40%. Honestly, I didn't know about this bias at first and trusted scores at face value — then I was caught off guard when swapping the order caused scores to shift dramatically.
- Verbosity Bias: Longer, more detailed answers receive about 15% over-scoring. This can be mitigated by explicitly stating "evaluate accuracy only, regardless of length" in the judge prompt (rubric).
- Self-Enhancement Bias: A 5–7% preference for responses from the same model family. This is what happens when you score with GPT-4o and GPT-family responses get an edge.
Rubric: The scoring criteria included in the judge prompt. Being specific — like "evaluate accuracy, conciseness, and contextual fit each as 0/1" rather than "choose a score from 1 to 5" — reduces bias.
Latency Tracing: Dissecting Bottlenecks at the Span Level
Agents that call LLM APIs hide latency bottlenecks in multiple places. Is the retriever slow? Is the LLM itself the bottleneck? Is it the response formatting? Looking only at total elapsed time makes it impossible to pinpoint the cause. Latency tracing breaks the entire request path into spans and records the delay at each stage in a structured way.
Three key metrics worth knowing:
| Metric | Meaning | Practical Target |
|---|---|---|
| TTFT (Time to First Token) | Time until the first token arrives | p95 ≤ 0.6 s for chat UIs |
| TPOT (Time Per Output Token) | Token generation speed | Determines streaming experience quality |
| E2E Latency | Full request-response cycle | Set SLO per service type |
A concept gaining traction since 2025 is Goodput. Rather than simply optimizing tokens per second (throughput), the idea is to use "the fraction of requests that simultaneously satisfy TTFT and TPOT SLOs" as the KPI. Expressed as a formula: Goodput = requests satisfying SLO / total requests. With NVIDIA, Anyscale, and BentoML adopting it as a core metric, it's spreading across the industry.
Goodput: Even if a server generates tokens quickly, it means nothing if it doesn't satisfy users' perceived SLOs. It's a concept that unifies engineering metrics and user experience metrics into one.
The industry standard has settled on OpenTelemetry's GenAI semantic conventions. Proposed in late 2024, this convention has been integrated into major tools including Langfuse (open-source LLM observability platform), Arize Phoenix (agent trajectory visualization tool), and Traceloop. The instrumentation overhead is under 1 ms per call — negligible compared to LLM API latency (100 ms to 30 s).
Practical Application
Example 1: RAG Quality Gating — Automatic Quality Checks on Every PR
There are common problems in RAG systems. You tweak the prompt slightly and suddenly get answers that contradict the retrieved content, or irrelevant documents get included as context. You can catch these automatically on every PR.
This structure uses RAGAS (a framework specialized for RAG evaluation) to measure four metrics and blocks deployment if any fall below the threshold.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Sample data — in practice, build this by selecting "good" cases from production logs
questions = ["What is your return policy?", "How long does shipping take?"]
generated_answers = [
"Returns are accepted within 30 days of purchase.",
"Standard shipping takes 3–5 business days.",
]
retrieved_contexts = [
["Our return policy allows returns within 30 days of purchase..."],
["Standard shipping takes 3–5 business days..."],
]
ground_truths = ["Returns accepted within 30 days of purchase", "3–5 business days"]
eval_dataset = Dataset.from_dict({
"question": questions,
"answer": generated_answers,
"contexts": retrieved_contexts,
"ground_truth": ground_truths,
})
result = evaluate(
eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
# Threshold check — CI failure blocks deployment if below threshold
thresholds = {
"faithfulness": 0.90,
"answer_relevancy": 0.85,
"context_precision": 0.80,
"context_recall": 0.80,
}
for metric, threshold in thresholds.items():
score = result[metric]
if score < threshold:
raise SystemExit(f"❌ {metric} = {score:.3f} < {threshold} — deployment blocked")
print("✅ All quality metrics passed")What each metric means:
| Metric | What It Measures | Low Score Suggests |
|---|---|---|
| Faithfulness | Is the generated answer faithful to the retrieved context? | Potential hallucination |
| Answer Relevancy | How relevant is the answer to the question? | Answer drifting off-topic |
| Context Precision | What fraction of retrieved documents were actually needed? | Retriever over-fetching |
| Context Recall | Was the necessary information retrieved? | Retriever missing content |
Attaching eval results as trace annotations in Langfuse makes it much easier to later analyze which question types are scoring low.
Example 2: LLM Span Instrumentation Setup — Tracing Every Call with OpenTelemetry
Honestly, when I first tried to connect OpenTelemetry to LLMs, I wasn't sure where to start. It's simpler than you'd think. The key is using the exact attribute names defined in the GenAI semantic conventions.
from openai import OpenAI
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Tracer initialization
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("agent-pipeline")
llm_client = OpenAI()
def call_llm_with_tracing(prompt: str, model: str = "gpt-4o") -> str:
with tracer.start_as_current_span("llm-call") as span:
# GenAI semantic convention attributes
span.set_attribute("gen_ai.system", "openai")
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.request.max_tokens", 1024)
response = llm_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
# Record token usage
span.set_attribute(
"gen_ai.usage.input_tokens",
response.usage.prompt_tokens
)
span.set_attribute(
"gen_ai.usage.output_tokens",
response.usage.completion_tokens
)
return response.choices[0].message.contentIf writing this manually every time feels tedious, openllmetry (an OpenTelemetry-based LLM auto-instrumentation library maintained by Traceloop) is worth trying. It automatically instruments major libraries including OpenAI, LangChain, and LlamaIndex.
from traceloop.sdk import Traceloop
Traceloop.init(app_name="my-agent", disable_batch=False)
# All subsequent LLM calls will automatically generate spansSeeing each LLM call's token count and latency visualized at the span level in the Langfuse dashboard for the first time is quite impressive. You can see at a glance which stage is consuming the most time.
Example 3: Multi-Step Agent Tool Call Validation — Automatically Verifying Order and Correctness
When an agent must operate in the sequence CRM lookup → ticket creation → email send, DeepEval's (an LLM evaluation framework developed by Confident AI) agent metrics help automatically verify that the order is correct.
Having used this on a real team, tool call ordering errors turned out to happen more often than expected. There were cases where the call order would get reversed after a minor prompt edit — without automatic validation, we would have caught it much later.
from deepeval import evaluate
from deepeval.metrics import AgentTaskCompletionMetric, ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase
# Collect trajectory after agent execution
test_case = LLMTestCase(
input="Process a refund request for customer ID 1234",
actual_output=agent_response,
# Actual tool call sequence (verify parameter names against your DeepEval API version)
actual_tool_calls_made=["get_customer_crm", "create_ticket", "send_email"],
# Expected tool call sequence
expected_tools=["get_customer_crm", "create_ticket", "send_email"],
)
task_metric = AgentTaskCompletionMetric(threshold=0.8, model="gpt-4o")
tool_metric = ToolCorrectnessMetric(threshold=0.9)
evaluate([test_case], metrics=[task_metric, tool_metric])Execution Trajectory: The complete record of tool call ordering, arguments, and intermediate results the agent chose while performing a task. It explains "why that result occurred" far better than looking at the final response alone.
Adding logic to route failure cases to a human review queue in Langfuse with a needs_human_review tag lets you strike a balance between full automation and human oversight.
Example 4: Inserting an Eval Step into the CI Pipeline — Automatic Scoring on Every Prompt Change
This is a workflow that automatically runs judge scoring against 100 golden dataset cases every time a prompt-editing PR is opened.
# .github/workflows/llm-eval.yml
name: LLM Eval CI
on:
pull_request:
paths:
- 'prompts/**'
- 'src/agent/**'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: pip install deepeval ragas
- name: Run LLM Evals
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
deepeval test run tests/eval/ \
--min-success-rate 0.85 \
--model gpt-4o-mini
- name: Upload eval results
uses: actions/upload-artifact@v4
with:
name: eval-results
path: eval_results.jsonIf the success rate falls below 85%, CI fails and the merge is blocked. Using gpt-4o-mini as the judge significantly cuts costs — scoring 100 cases runs about $0.01–$0.05 (as of writing; subject to change with model pricing). Using the paths: trigger to limit eval runs to only PRs that change prompt files also avoids unnecessary costs.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Cost efficiency | 500–5,000× cheaper than human labeling, executable immediately |
| Subjective quality evaluation | Can measure quality dimensions hard to capture with rules — tone, fluency, contextual fit |
| Flexible evaluation criteria | Adding a new criterion only requires editing the judge prompt |
| Automation integration | Insert into CI/CD to automatically detect regressions on every prompt or model change |
| Tracing overhead | OpenTelemetry instrumentation overhead under 1 ms — negligible impact on live services |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Position Bias | Favors answers presented first (up to 40% disagreement) | Evaluate twice with swapped order, then average |
| Verbosity Bias | Over-scores longer, more detailed answers (~15%) | Explicitly state "evaluate accuracy only, regardless of length" in the rubric |
| Self-Enhancement Bias | Prefers responses from the same model family (5–7%) | Cross-validate with heterogeneous judges (e.g., GPT-4o + Claude) |
| Preference Leakage | The subject of evaluation may be present in the judge's training data | Conduct periodic data contamination audits |
| Judge Drift | The judge itself may change in performance over time | Regularly conduct meta-evaluation of judge-human agreement rates |
| Sampling costs | 100% sampling in high-load environments creates storage and cost burdens | Use 5–10% sampling in production, expand on anomaly detection |
The Most Common Mistakes in Practice
-
Trusting scores at face value without correcting for judge bias. The same content scored with only the response order swapped can differ by up to 40%. At a minimum, applying order-swap averaging is recommended.
-
Monitoring only p50 latency. The average may look fine while p99 contains requests taking 30 seconds each. Setting SLOs at the p95/p99 level better reflects actual user experience.
-
Building offline eval only and skipping production monitoring. Even if the golden set passes, real traffic brings different types of questions. Running online sampling eval in parallel lets you quickly detect post-deployment drift.
Closing Thoughts
The core of an agent eval pipeline is building a structure that detects when something has gone wrong before your users do. Trying to build a perfect pipeline from day one will actually prevent you from starting anything. A gradual, iterative approach is far more realistic.
Three steps you can take right now:
-
Create a free Langfuse cloud account and connect openllmetry to your existing LLM calls. After
pip install traceloop-sdk, a single line ofTraceloop.init()is all it takes for traces to start being collected automatically. -
If you have a RAG system or agent, start by building a golden dataset of 20–50 cases. The fastest approach is picking "good" cases from your current production logs. Measuring a baseline score with RAGAS or DeepEval gives you an objective way to compare whether future changes are improvements or regressions.
-
Add a
deepeval test runstep to GitHub Actions and configure it to run only on PRs that change prompt files. For the success rate threshold, start low (70–75%) and gradually raise it as your team gets comfortable.
References
Evaluation Methodology
- LLM-as-a-Judge: A Complete Guide | Evidently AI
- A Survey on LLM-as-a-Judge | ScienceDirect (2025)
- LLM-as-a-Judge | Langfuse Official Docs
- LLM-as-a-Judge Done Right: Calibrating & Debiasing | Kinde
- When AIs Judge AIs: Agent-as-a-Judge Evaluation | arXiv
- Evaluating LLM Applications: A Comprehensive Roadmap | Langfuse Blog
Tools
- 8 LLM Observability Tools to Monitor & Evaluate AI Agents | LangChain
- LLM Observability Best Practices for 2025 | Maxim AI
- Best AI Evals Tools for CI/CD in 2025 | Braintrust
CI Integration
- How to Add LLM Evaluations to CI/CD Pipelines | Arize AI
- Implement a CI/CD Pipeline using LangSmith | LangChain Official Docs
Performance Metrics and Observability
- LLM Observability with OpenTelemetry: A Practical Guide | Medium
- An Introduction to Observability for LLM-based Applications using OpenTelemetry | OpenTelemetry Official Blog
- OpenTelemetry for LLMs: Complete SRE Guide | OpenObserve
- TTFT vs Throughput: Which Metric Impacts Users More? | Clarifai
- Key Metrics for LLM Inference | BentoML LLM Inference Handbook