Building an AI Agent Monitoring & Evaluation System: Catching Quality That Silently Breaks in Production with DeepEval and Langfuse
Getting an AI agent to run well locally is not as hard as you might think. But teams that can confidently answer "Is this agent working correctly right now?" after deploying it to production are surprisingly rare. When I deployed my first agent, I remember discovering days later that it had been calling completely wrong tools in certain cases. There were no logs, no tracing.
According to LangChain's 2026 report, 57% of organizations are already deploying agents to production. At the same time, 32% of respondents cited "output quality" as the biggest barrier to deployment. They can build it, but they can't tell if it's trustworthy.
By the end of this article, you will have a concrete method for attaching tracing to an existing agent and setting up a CI pipeline with a quality gate that runs on every PR. This is different from simple LLM response quality checks. Because agents involve multi-step reasoning, external tool calls, long-context retention, and autonomous decision-making, the evaluation system needs to be equally sophisticated. I'll break it down to the level where you can copy and use the code directly.
Core Concepts
Building evaluation infrastructure has a natural order. We'll go through what to measure (evaluation dimensions), who evaluates (LLM-as-a-Judge), where to trace (tracing), and which composite metric to use (the CLEAR framework).
Why Agent Evaluation Differs from General LLM Evaluation
General LLM evaluation is simple: feed input, check output, pass if good. Agents are different. An agent receives a goal, devises its own plan, calls multiple tools in sequence, and decides the next action based on intermediate results. If something goes wrong somewhere along this long execution path, you cannot find the cause by looking at the final output alone.
That's why agent evaluation looks at six major dimensions.
| Evaluation Dimension | Description | Why It Matters |
|---|---|---|
| Task Completion Rate | Whether the given goal was achieved accurately | The most basic pass/fail criterion |
| Reasoning Path Quality | Whether the correct path was taken, not just the correct result | Distinguishes a lucky correct answer from a logically correct one |
| Hallucination Rate | How often content is generated that differs from the facts | Critical in sensitive domains like customer service, finance, and healthcare |
| Tool Call Accuracy | Whether appropriate tools are called with appropriate arguments | Incorrect tool calls create cascading errors |
| Cost & Latency | Token usage, response time, operational cost | There are teams that focused only on accuracy and got a 4x infrastructure bill |
| Safety & Compliance | Bias, harmful content, policy adherence | Mandatory pre-deployment check in regulated industries |
Offline vs. Online Evaluation: Offline evaluation runs against a pre-built golden dataset (a reference set of input/output pairs representing expected behavior) before deployment, while online evaluation tracks real production traffic in real time. Running both is ideal, but in practice far more teams establish observability (online) first.
LLM-as-a-Judge: Having an LLM Evaluate Instead of a Human
Honestly, Human Eval is slow and expensive. Having people review hundreds of responses for every deployment isn't realistic. That's why the industry standard that has taken hold is LLM-as-a-Judge — using a powerful LLM (GPT-4o, Claude Opus, etc.) as the evaluator.
The basic idea is simple: you ask an evaluation LLM to "score this agent response against the following criteria." Major platforms like DeepEval, Langfuse, and Arize Phoenix all support this, and implementing it yourself isn't difficult either.
from deepeval import evaluate
from deepeval.metrics import (
AnswerRelevancyMetric,
HallucinationMetric,
ToolCorrectnessMetric,
)
from deepeval.test_case import LLMTestCase, ToolCall
test_case = LLMTestCase(
input="서울 날씨 알려줘",
actual_output="서울은 현재 맑고 기온은 22도입니다.",
context=["weather_api 호출 결과: {'city': 'Seoul', 'temp': 22, 'condition': 'clear'}"],
tools_called=[
ToolCall(name="weather_api", input_parameters={"city": "Seoul"})
],
expected_tools=[
ToolCall(name="weather_api", input_parameters={"city": "Seoul"})
],
)
metrics = [
AnswerRelevancyMetric(threshold=0.7, model="gpt-4o"),
HallucinationMetric(threshold=0.3, model="gpt-4o"),
ToolCorrectnessMetric(), # LLM API 비용 없이 결정론적으로 평가
]
evaluate(test_cases=[test_case], metrics=metrics)Self-Consistency Bias Warning: Evaluating a GPT-4o-based agent with GPT-4o can inflate scores. Models from the same company tend to rate their own outputs favorably. Whenever possible, use a model from a different company as the evaluator. For a GPT-4o agent, evaluate with Claude Opus; for a Claude agent, evaluate with GPT-4o.
Tracing Agent Execution Paths with OpenTelemetry
The most important thing in agent debugging is having a trace — the complete execution path a single request travels from start to finish — that shows "what happened, in what order." This is one of the lessons Amazon learned from building production agents: "Debugging is impossible without step-by-step reasoning logs."
OpenTelemetry is the industry standard for collecting traces in distributed systems. In agent tracing, a trace is organized into three layers of spans (span — the individual unit of work that makes up a trace):
- Root span: The entire user request (agent execution start to finish)
- Reasoning span: Individual LLM call units (records prompt, token count, and latency)
- Tool span: External API or function call units (records input parameters, return values, and duration)
Langfuse lets you apply this hierarchy with a single Python decorator.
from langfuse.decorators import observe, langfuse_context
from langfuse.openai import openai # OpenAI 클라이언트 래핑
@observe() # 루트 스팬: 이 함수 전체가 하나의 트레이스로 기록됨
async def run_agent(user_query: str) -> str:
langfuse_context.update_current_trace(
name="customer-support-agent",
user_id="user-123",
tags=["production", "v2.1"],
)
# 도구 스팬: 외부 도구 호출을 별도로 기록
with langfuse_context.span("tool:search_kb"):
kb_results = await search_knowledge_base(user_query)
# 추론 스팬: LLM 호출 시 토큰·비용이 자동으로 기록됨
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "당신은 고객 지원 에이전트입니다."},
{"role": "user", "content": user_query},
{"role": "assistant", "content": f"검색 결과: {kb_results}"},
],
)
return response.choices[0].message.contentWith this setup, the Langfuse dashboard gives you a clear view for each user request: which tools the agent called and in what order, how many tokens were used at each step, and where latency occurred.
The CLEAR Framework: Focusing Only on Accuracy Leads to Failure
I used to think "as long as the agent gives the right answer, that's enough" — but when you actually operate one, there's far more to care about beyond accuracy. According to the arXiv paper "Beyond Accuracy: Multi-Dimensional Framework for Evaluating Enterprise Agentic AI," agents optimized for accuracy alone can cost 4.4–10.8x more than cost-aware alternatives.
The CLEAR framework is the multi-dimensional evaluation approach proposed in that paper for enterprise environments.
| Item | Meaning | How to Measure |
|---|---|---|
| Cost | Token costs, infrastructure costs | Average cost per request, total monthly cost |
| Latency | Response time | P50/P95/P99 latency distribution |
| Efficacy | Goal achievement rate | Task completion rate, user satisfaction |
| Assurance | Safety, compliance | Policy violation rate, bias score |
| Reliability | Stability, consistency | Error rate, response consistency for identical inputs |
All five items must be viewed together on one screen. That is the key.
Practical Application
The three examples below all use the same customer support agent as the basis. The sequence is: first build the CI quality gate with DeepEval, then attach production tracing with Langfuse, and finally add tool call accuracy evaluation. Following this flow in order produces a complete evaluation system for a single agent.
Note: All examples below are written for Python. If you are running a JavaScript/TypeScript agent, it is recommended to consult the official JS SDK documentation for each tool alongside this guide.
Setting Up an Agent Quality Gate in CI/CD with DeepEval
One reason "pre-deployment verification" is harder than it sounds is the question of how to integrate LLM evaluation into your existing test infrastructure. DeepEval bills itself as "pytest for LLMs," and it actually uses pytest syntax directly — making it very easy to attach to an existing CI/CD pipeline.
# test_agent_eval.py
import pytest
from deepeval import assert_test
from deepeval.metrics import (
AnswerRelevancyMetric,
HallucinationMetric,
GEval,
)
from deepeval.test_case import LLMTestCase, ToolCall
# --- 실제 에이전트 모듈로 교체하세요 ---
async def run_customer_support_agent(query: str) -> str:
return "30일 이내 구매 시 영수증 지참하면 전액 환불 가능합니다."
# ----------------------------------------
# 골든 데이터셋: 핵심 시나리오를 입력-컨텍스트-기대출력 쌍으로 정의
TEST_CASES = [
{
"input": "환불 정책이 어떻게 되나요?",
"context": ["30일 이내 영수증 지참 시 전액 환불 가능"],
"expected_output": "30일 이내 구매 시 환불 가능",
},
{
"input": "주문 취소는 어떻게 하나요?",
"context": ["배송 시작 전 앱에서 직접 취소 가능, 배송 후는 반품 절차 진행"],
"expected_output": "배송 시작 전 앱에서 취소 가능",
},
]
@pytest.mark.parametrize("case", TEST_CASES)
@pytest.mark.asyncio
async def test_customer_support_agent(case):
actual_output = await run_customer_support_agent(case["input"])
test_case = LLMTestCase(
input=case["input"],
actual_output=actual_output,
context=case["context"],
expected_output=case["expected_output"],
)
# 커스텀 평가 기준도 자연어로 정의할 수 있는데, 처음 짰을 때 꽤 놀랐습니다
policy_adherence = GEval(
name="PolicyAdherence",
criteria="응답이 제공된 컨텍스트의 정책을 정확히 반영하는가",
evaluation_params=["actual_output", "context"],
threshold=0.8,
)
assert_test(
test_case,
metrics=[
AnswerRelevancyMetric(threshold=0.7, model="gpt-4o-mini"),
HallucinationMetric(threshold=0.2, model="gpt-4o-mini"),
policy_adherence,
],
)Register this file with GitHub Actions and agent quality will be automatically verified on every PR. If a metric score falls below its threshold, the deployment is blocked.
# .github/workflows/agent-eval.yml
name: Agent Evaluation
on: [pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install dependencies
run: pip install deepeval langfuse openai pytest pytest-asyncio
- name: Run agent evaluation
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}
run: deepeval test run test_agent_eval.pyReal-Time Production Agent Tracing with Langfuse
Passing offline evaluation before deployment is not the end. Real users interact with agents in ways that test sets could never anticipate. Online evaluation's role is to track production traffic in real time and detect quality degradation.
The code below is written for an asynchronous web server environment such as FastAPI or Starlette. asyncio.create_task() requires a running event loop, so if you are in a synchronous environment, it is recommended to explore a separate async processing approach.
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
import asyncio
import json
langfuse = Langfuse()
@observe()
async def run_agent_with_online_eval(user_id: str, query: str) -> str:
# --- 실제 에이전트 호출로 교체하세요 ---
result = f"'{query}'에 대한 고객 지원 응답입니다."
# ----------------------------------------
trace_id = langfuse_context.get_current_trace_id()
# 사용자 응답 지연 없이 백그라운드에서 품질 평가 실행
asyncio.create_task(
run_online_evaluation(trace_id, query, result)
)
return result
async def run_online_evaluation(trace_id: str, query: str, response: str):
"""백그라운드에서 LLM-as-a-Judge 평가 실행"""
from openai import AsyncOpenAI
client = AsyncOpenAI()
eval_prompt = f"""
사용자 질문: {query}
에이전트 응답: {response}
다음 기준으로 응답 품질을 0.0~1.0 사이 점수로만 답해주세요:
- 질문에 대한 관련성 (0.4)
- 정확성과 사실 기반 (0.4)
- 간결하고 명확한 설명 (0.2)
JSON 형식: {{"score": 0.85, "reason": "간략 이유"}}
"""
eval_response = await client.chat.completions.create(
model="gpt-4o-mini", # 비용 절감을 위해 미니 모델 활용
messages=[{"role": "user", "content": eval_prompt}],
response_format={"type": "json_object"},
)
result = json.loads(eval_response.choices[0].message.content)
langfuse.score(
trace_id=trace_id,
name="online-quality-score",
value=result["score"],
comment=result["reason"],
)
if result["score"] < 0.6:
await send_alert(trace_id, result["score"], result["reason"])Behavioral Drift: This is a phenomenon where an agent's response patterns gradually shift over time from how it behaved at initial deployment. It's not a simple error but a subtle change, making it hard to catch. If the average quality score is slowly declining on a week-by-week basis, drift is worth suspecting.
Building a Dedicated Test Set for Tool Call Accuracy
The hardest problems to debug in agents are tool-call-related. When an agent calls search_orders instead of search_products, or calls the right tool but passes the wrong parameters, you cannot determine the cause from the final response alone. You can use the ToolCall from the first example to build a dedicated tool call test set.
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import ToolCorrectnessMetric
# 고객 지원 에이전트의 도구 호출 시나리오
tool_test_cases = [
LLMTestCase(
input="지난달 주문 내역 조회해줘",
actual_output="지난달 주문 3건을 찾았습니다...",
tools_called=[
ToolCall(
name="get_orders",
input_parameters={
"user_id": "user-123",
"date_range": "last_month",
},
)
],
expected_tools=[
ToolCall(
name="get_orders",
input_parameters={
"user_id": "user-123",
"date_range": "last_month",
},
)
],
),
LLMTestCase(
input="이 제품 재고 있어?",
actual_output="현재 재고가 50개 있습니다.",
tools_called=[
ToolCall(
name="check_inventory",
input_parameters={"product_id": "PROD-456"},
)
],
expected_tools=[
ToolCall(
name="check_inventory",
input_parameters={"product_id": "PROD-456"},
)
],
),
]
# ToolCorrectnessMetric은 LLM API 비용 없이 결정론적으로 평가
metric = ToolCorrectnessMetric()
for tc in tool_test_cases:
metric.measure(tc)
print(f"입력: {tc.input}")
print(f"도구 정확도: {metric.score:.2f} - {metric.reason}\n")Pros and Cons
Advantages
| Item | Details |
|---|---|
| Early Detection of Quality Bottlenecks | Offline evaluation detects regressions before deployment, preventing production incidents |
| Data for Cost Optimization | Token usage and latency data provide a basis for prompt and model selection decisions |
| Building Trust and Explainability | Behavioral trace logs serve as an audit trail for agent outputs, useful for regulatory compliance |
| Continuous Improvement Loop | Creates a flywheel where production failures are collected and used to strengthen the test set |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Non-deterministic Nature | Agent outputs vary even for identical inputs, making simple comparisons difficult | Design probabilistic evaluations; use averages across multiple runs |
| Evaluation Cost | LLM-as-a-Judge incurs API costs; significant expense at scale | Use gpt-4o-mini for first-pass filtering, then review with a larger model |
| Benchmark-Production Gap | High benchmark scores do not guarantee successful real-world deployment | Continuously add production failure cases to the test set |
| Multi-step Tracing Complexity | Building tracing infrastructure for execution paths involving interleaved sub-agents is complex | Adopt the OpenTelemetry standard; expand tracing scope incrementally |
| Security & Privacy | Traces contain user input, creating personal data concerns | Mask sensitive data; formalize data retention policies |
| Vendor Lock-in | Deep integration with specific frameworks incurs migration costs | Maintain a standard OpenTelemetry-based layer; prefer self-hostable tools |
Survivorship Bias: Online evaluation only collects input from people who actually used the agent. It cannot capture which types of users gave up and left, or which questions were never even entered. Relying solely on online evaluation data means you may permanently miss edge cases.
The Most Common Mistakes in Practice
-
Evaluating only the final output while ignoring the reasoning path. Just because an agent happened to give the right answer doesn't mean it reasoned correctly. If you don't also track the quality of intermediate steps, you accumulate successes you can't reproduce.
-
Building a golden dataset once and never updating it. Real users find new ways to use agents that generate new failure patterns continuously. Without a loop that adds production failures to the test set, the test set quickly drifts away from reality.
-
Treating cost and latency separately from quality. Many teams monitor only accuracy and then get a surprise infrastructure bill. As the CLEAR framework illustrates, it's important to include cost and latency in your evaluation metrics from the start.
Closing Thoughts
Making an agent trustworthy in production is as important an engineering task as building it in the first place. The good news is that Langfuse, DeepEval, and Arize Phoenix are all open source and self-hostable, so you can get started without additional cost. You don't need to upgrade to a paid plan — the basic features are sufficient to begin.
The three steps below have ordering dependencies. Tracing must come first so that online evaluation data can accumulate, and only once that data accumulates can the CI gate's golden dataset be improved against reality. It is recommended to proceed in this order.
-
Start by attaching tracing. After
pip install langfuse, you can begin by adding a single@observe()decorator to an existing agent function. Simply visualizing the execution path in the dashboard will already reveal problems that were previously invisible. -
Build a golden dataset of 10 cases and connect it to CI. Start with around 10 core scenarios, configure
deepeval test runto execute automatically on every PR, and you have a regression-detection pipeline. The effective approach is to add one case each time a production failure occurs, gradually building a more realistic dataset. -
Make a habit of viewing cost and latency dashboards on the same screen as quality metrics. Using Langfuse or Datadog LLM Observability, you can see token costs, P99 latency, and quality scores on a single screen. Once you develop the habit of viewing all three numbers together, you can make data-driven judgments about which prompt changes reduce costs while maintaining quality.
References
- LangChain State of Agent Engineering 2026 — Statistics on agent production adoption and deployment barriers
- Evaluating AI Agents: Real-World Lessons from Amazon | AWS Blog — Practical lessons including the necessity of step-by-step reasoning logs
- Beyond Accuracy: Multi-Dimensional Framework for Evaluating Enterprise Agentic AI | arXiv — Original CLEAR framework paper; source of cost optimization figures
- AI Agent Evaluation Guide | DeepEval by Confident AI — DeepEval official agent evaluation documentation
- Agent Observability: LangSmith, Langfuse, Arize 2026 | Digital Applied — Comparison of major observability platforms
- Top 8 LLM Observability Tools | LangChain Articles — List of LLM observability tools
- How Can We Best Evaluate Agentic AI? | Brookings — Social and policy considerations for agent evaluation