Building an AI Agent Monitoring & Evaluation System: Catching Quality That Silently Breaks in Production with DeepEval and Langfuse

Getting an AI agent to run well locally is not as hard as you might think. But teams that can confidently answer "Is this agent working correctly right now?" after deploying it to production are surprisingly rare. When I deployed my first agent, I remember discovering days later that it had been calling completely wrong tools in certain cases. There were no logs, no tracing.

According to LangChain's 2026 report, 57% of organizations are already deploying agents to production. At the same time, 32% of respondents cited "output quality" as the biggest barrier to deployment. They can build it, but they can't tell if it's trustworthy.

By the end of this article, you will have a concrete method for attaching tracing to an existing agent and setting up a CI pipeline with a quality gate that runs on every PR. This is different from simple LLM response quality checks. Because agents involve multi-step reasoning, external tool calls, long-context retention, and autonomous decision-making, the evaluation system needs to be equally sophisticated. I'll break it down to the level where you can copy and use the code directly.

Core Concepts

Building evaluation infrastructure has a natural order. We'll go through what to measure (evaluation dimensions), who evaluates (LLM-as-a-Judge), where to trace (tracing), and which composite metric to use (the CLEAR framework).

Why Agent Evaluation Differs from General LLM Evaluation

General LLM evaluation is simple: feed input, check output, pass if good. Agents are different. An agent receives a goal, devises its own plan, calls multiple tools in sequence, and decides the next action based on intermediate results. If something goes wrong somewhere along this long execution path, you cannot find the cause by looking at the final output alone.

That's why agent evaluation looks at six major dimensions.

Evaluation Dimension	Description	Why It Matters
Task Completion Rate	Whether the given goal was achieved accurately	The most basic pass/fail criterion
Reasoning Path Quality	Whether the correct path was taken, not just the correct result	Distinguishes a lucky correct answer from a logically correct one
Hallucination Rate	How often content is generated that differs from the facts	Critical in sensitive domains like customer service, finance, and healthcare
Tool Call Accuracy	Whether appropriate tools are called with appropriate arguments	Incorrect tool calls create cascading errors
Cost & Latency	Token usage, response time, operational cost	There are teams that focused only on accuracy and got a 4x infrastructure bill
Safety & Compliance	Bias, harmful content, policy adherence	Mandatory pre-deployment check in regulated industries

Offline vs. Online Evaluation: Offline evaluation runs against a pre-built golden dataset (a reference set of input/output pairs representing expected behavior) before deployment, while online evaluation tracks real production traffic in real time. Running both is ideal, but in practice far more teams establish observability (online) first.

LLM-as-a-Judge: Having an LLM Evaluate Instead of a Human

Honestly, Human Eval is slow and expensive. Having people review hundreds of responses for every deployment isn't realistic. That's why the industry standard that has taken hold is LLM-as-a-Judge — using a powerful LLM (GPT-4o, Claude Opus, etc.) as the evaluator.

The basic idea is simple: you ask an evaluation LLM to "score this agent response against the following criteria." Major platforms like DeepEval, Langfuse, and Arize Phoenix all support this, and implementing it yourself isn't difficult either.

python

from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    HallucinationMetric,
    ToolCorrectnessMetric,
)
from deepeval.test_case import LLMTestCase, ToolCall
 
test_case = LLMTestCase(
    input="서울 날씨 알려줘",
    actual_output="서울은 현재 맑고 기온은 22도입니다.",
    context=["weather_api 호출 결과: {'city': 'Seoul', 'temp': 22, 'condition': 'clear'}"],
    tools_called=[
        ToolCall(name="weather_api", input_parameters={"city": "Seoul"})
    ],
    expected_tools=[
        ToolCall(name="weather_api", input_parameters={"city": "Seoul"})
    ],
)
 
metrics = [
    AnswerRelevancyMetric(threshold=0.7, model="gpt-4o"),
    HallucinationMetric(threshold=0.3, model="gpt-4o"),
    ToolCorrectnessMetric(),  # LLM API 비용 없이 결정론적으로 평가
]
 
evaluate(test_cases=[test_case], metrics=metrics)

Self-Consistency Bias Warning: Evaluating a GPT-4o-based agent with GPT-4o can inflate scores. Models from the same company tend to rate their own outputs favorably. Whenever possible, use a model from a different company as the evaluator. For a GPT-4o agent, evaluate with Claude Opus; for a Claude agent, evaluate with GPT-4o.

Tracing Agent Execution Paths with OpenTelemetry

The most important thing in agent debugging is having a trace — the complete execution path a single request travels from start to finish — that shows "what happened, in what order." This is one of the lessons Amazon learned from building production agents: "Debugging is impossible without step-by-step reasoning logs."

OpenTelemetry is the industry standard for collecting traces in distributed systems. In agent tracing, a trace is organized into three layers of spans (span — the individual unit of work that makes up a trace):

Root span: The entire user request (agent execution start to finish)
Reasoning span: Individual LLM call units (records prompt, token count, and latency)
Tool span: External API or function call units (records input parameters, return values, and duration)

Langfuse lets you apply this hierarchy with a single Python decorator.

python

from langfuse.decorators import observe, langfuse_context
from langfuse.openai import openai  # OpenAI 클라이언트 래핑
 
@observe()  # 루트 스팬: 이 함수 전체가 하나의 트레이스로 기록됨
async def run_agent(user_query: str) -> str:
    langfuse_context.update_current_trace(
        name="customer-support-agent",
        user_id="user-123",
        tags=["production", "v2.1"],
    )
 
    # 도구 스팬: 외부 도구 호출을 별도로 기록
    with langfuse_context.span("tool:search_kb"):
        kb_results = await search_knowledge_base(user_query)
 
    # 추론 스팬: LLM 호출 시 토큰·비용이 자동으로 기록됨
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "당신은 고객 지원 에이전트입니다."},
            {"role": "user", "content": user_query},
            {"role": "assistant", "content": f"검색 결과: {kb_results}"},
        ],
    )
 
    return response.choices[0].message.content

With this setup, the Langfuse dashboard gives you a clear view for each user request: which tools the agent called and in what order, how many tokens were used at each step, and where latency occurred.

The CLEAR Framework: Focusing Only on Accuracy Leads to Failure

I used to think "as long as the agent gives the right answer, that's enough" — but when you actually operate one, there's far more to care about beyond accuracy. According to the arXiv paper "Beyond Accuracy: Multi-Dimensional Framework for Evaluating Enterprise Agentic AI," agents optimized for accuracy alone can cost 4.4–10.8x more than cost-aware alternatives.

The CLEAR framework is the multi-dimensional evaluation approach proposed in that paper for enterprise environments.

Item	Meaning	How to Measure
Cost	Token costs, infrastructure costs	Average cost per request, total monthly cost
Latency	Response time	P50/P95/P99 latency distribution
Efficacy	Goal achievement rate	Task completion rate, user satisfaction
Assurance	Safety, compliance	Policy violation rate, bias score
Reliability	Stability, consistency	Error rate, response consistency for identical inputs

All five items must be viewed together on one screen. That is the key.

Practical Application

The three examples below all use the same customer support agent as the basis. The sequence is: first build the CI quality gate with DeepEval, then attach production tracing with Langfuse, and finally add tool call accuracy evaluation. Following this flow in order produces a complete evaluation system for a single agent.

Note: All examples below are written for Python. If you are running a JavaScript/TypeScript agent, it is recommended to consult the official JS SDK documentation for each tool alongside this guide.

Setting Up an Agent Quality Gate in CI/CD with DeepEval

One reason "pre-deployment verification" is harder than it sounds is the question of how to integrate LLM evaluation into your existing test infrastructure. DeepEval bills itself as "pytest for LLMs," and it actually uses pytest syntax directly — making it very easy to attach to an existing CI/CD pipeline.

python

# test_agent_eval.py
import pytest
from deepeval import assert_test
from deepeval.metrics import (
    AnswerRelevancyMetric,
    HallucinationMetric,
    GEval,
)
from deepeval.test_case import LLMTestCase, ToolCall
 
# --- 실제 에이전트 모듈로 교체하세요 ---
async def run_customer_support_agent(query: str) -> str:
    return "30일 이내 구매 시 영수증 지참하면 전액 환불 가능합니다."
# ----------------------------------------
 
# 골든 데이터셋: 핵심 시나리오를 입력-컨텍스트-기대출력 쌍으로 정의
TEST_CASES = [
    {
        "input": "환불 정책이 어떻게 되나요?",
        "context": ["30일 이내 영수증 지참 시 전액 환불 가능"],
        "expected_output": "30일 이내 구매 시 환불 가능",
    },
    {
        "input": "주문 취소는 어떻게 하나요?",
        "context": ["배송 시작 전 앱에서 직접 취소 가능, 배송 후는 반품 절차 진행"],
        "expected_output": "배송 시작 전 앱에서 취소 가능",
    },
]
 
@pytest.mark.parametrize("case", TEST_CASES)
@pytest.mark.asyncio
async def test_customer_support_agent(case):
    actual_output = await run_customer_support_agent(case["input"])
 
    test_case = LLMTestCase(
        input=case["input"],
        actual_output=actual_output,
        context=case["context"],
        expected_output=case["expected_output"],
    )
 
    # 커스텀 평가 기준도 자연어로 정의할 수 있는데, 처음 짰을 때 꽤 놀랐습니다
    policy_adherence = GEval(
        name="PolicyAdherence",
        criteria="응답이 제공된 컨텍스트의 정책을 정확히 반영하는가",
        evaluation_params=["actual_output", "context"],
        threshold=0.8,
    )
 
    assert_test(
        test_case,
        metrics=[
            AnswerRelevancyMetric(threshold=0.7, model="gpt-4o-mini"),
            HallucinationMetric(threshold=0.2, model="gpt-4o-mini"),
            policy_adherence,
        ],
    )

Register this file with GitHub Actions and agent quality will be automatically verified on every PR. If a metric score falls below its threshold, the deployment is blocked.

yaml

# .github/workflows/agent-eval.yml
name: Agent Evaluation
 
on: [pull_request]
 
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - name: Install dependencies
        run: pip install deepeval langfuse openai pytest pytest-asyncio
      - name: Run agent evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}
        run: deepeval test run test_agent_eval.py

Real-Time Production Agent Tracing with Langfuse

Passing offline evaluation before deployment is not the end. Real users interact with agents in ways that test sets could never anticipate. Online evaluation's role is to track production traffic in real time and detect quality degradation.

The code below is written for an asynchronous web server environment such as FastAPI or Starlette. asyncio.create_task() requires a running event loop, so if you are in a synchronous environment, it is recommended to explore a separate async processing approach.

python

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
import asyncio
import json
 
langfuse = Langfuse()
 
@observe()
async def run_agent_with_online_eval(user_id: str, query: str) -> str:
    # --- 실제 에이전트 호출로 교체하세요 ---
    result = f"'{query}'에 대한 고객 지원 응답입니다."
    # ----------------------------------------
 
    trace_id = langfuse_context.get_current_trace_id()
 
    # 사용자 응답 지연 없이 백그라운드에서 품질 평가 실행
    asyncio.create_task(
        run_online_evaluation(trace_id, query, result)
    )
 
    return result
 
async def run_online_evaluation(trace_id: str, query: str, response: str):
    """백그라운드에서 LLM-as-a-Judge 평가 실행"""
    from openai import AsyncOpenAI
 
    client = AsyncOpenAI()
 
    eval_prompt = f"""
    사용자 질문: {query}
    에이전트 응답: {response}
 
    다음 기준으로 응답 품질을 0.0~1.0 사이 점수로만 답해주세요:
    - 질문에 대한 관련성 (0.4)
    - 정확성과 사실 기반 (0.4)
    - 간결하고 명확한 설명 (0.2)
 
    JSON 형식: {{"score": 0.85, "reason": "간략 이유"}}
    """
 
    eval_response = await client.chat.completions.create(
        model="gpt-4o-mini",  # 비용 절감을 위해 미니 모델 활용
        messages=[{"role": "user", "content": eval_prompt}],
        response_format={"type": "json_object"},
    )
 
    result = json.loads(eval_response.choices[0].message.content)
 
    langfuse.score(
        trace_id=trace_id,
        name="online-quality-score",
        value=result["score"],
        comment=result["reason"],
    )
 
    if result["score"] < 0.6:
        await send_alert(trace_id, result["score"], result["reason"])

Behavioral Drift: This is a phenomenon where an agent's response patterns gradually shift over time from how it behaved at initial deployment. It's not a simple error but a subtle change, making it hard to catch. If the average quality score is slowly declining on a week-by-week basis, drift is worth suspecting.

Building a Dedicated Test Set for Tool Call Accuracy

The hardest problems to debug in agents are tool-call-related. When an agent calls search_orders instead of search_products, or calls the right tool but passes the wrong parameters, you cannot determine the cause from the final response alone. You can use the ToolCall from the first example to build a dedicated tool call test set.

python

from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import ToolCorrectnessMetric
 
# 고객 지원 에이전트의 도구 호출 시나리오
tool_test_cases = [
    LLMTestCase(
        input="지난달 주문 내역 조회해줘",
        actual_output="지난달 주문 3건을 찾았습니다...",
        tools_called=[
            ToolCall(
                name="get_orders",
                input_parameters={
                    "user_id": "user-123",
                    "date_range": "last_month",
                },
            )
        ],
        expected_tools=[
            ToolCall(
                name="get_orders",
                input_parameters={
                    "user_id": "user-123",
                    "date_range": "last_month",
                },
            )
        ],
    ),
    LLMTestCase(
        input="이 제품 재고 있어?",
        actual_output="현재 재고가 50개 있습니다.",
        tools_called=[
            ToolCall(
                name="check_inventory",
                input_parameters={"product_id": "PROD-456"},
            )
        ],
        expected_tools=[
            ToolCall(
                name="check_inventory",
                input_parameters={"product_id": "PROD-456"},
            )
        ],
    ),
]
 
# ToolCorrectnessMetric은 LLM API 비용 없이 결정론적으로 평가
metric = ToolCorrectnessMetric()
for tc in tool_test_cases:
    metric.measure(tc)
    print(f"입력: {tc.input}")
    print(f"도구 정확도: {metric.score:.2f} - {metric.reason}\n")

Pros and Cons

Advantages

Item	Details
Early Detection of Quality Bottlenecks	Offline evaluation detects regressions before deployment, preventing production incidents
Data for Cost Optimization	Token usage and latency data provide a basis for prompt and model selection decisions
Building Trust and Explainability	Behavioral trace logs serve as an audit trail for agent outputs, useful for regulatory compliance
Continuous Improvement Loop	Creates a flywheel where production failures are collected and used to strengthen the test set

Disadvantages and Caveats

Item	Details	Mitigation
Non-deterministic Nature	Agent outputs vary even for identical inputs, making simple comparisons difficult	Design probabilistic evaluations; use averages across multiple runs
Evaluation Cost	LLM-as-a-Judge incurs API costs; significant expense at scale	Use gpt-4o-mini for first-pass filtering, then review with a larger model
Benchmark-Production Gap	High benchmark scores do not guarantee successful real-world deployment	Continuously add production failure cases to the test set
Multi-step Tracing Complexity	Building tracing infrastructure for execution paths involving interleaved sub-agents is complex	Adopt the OpenTelemetry standard; expand tracing scope incrementally
Security & Privacy	Traces contain user input, creating personal data concerns	Mask sensitive data; formalize data retention policies
Vendor Lock-in	Deep integration with specific frameworks incurs migration costs	Maintain a standard OpenTelemetry-based layer; prefer self-hostable tools

Survivorship Bias: Online evaluation only collects input from people who actually used the agent. It cannot capture which types of users gave up and left, or which questions were never even entered. Relying solely on online evaluation data means you may permanently miss edge cases.

The Most Common Mistakes in Practice

Evaluating only the final output while ignoring the reasoning path. Just because an agent happened to give the right answer doesn't mean it reasoned correctly. If you don't also track the quality of intermediate steps, you accumulate successes you can't reproduce.
Building a golden dataset once and never updating it. Real users find new ways to use agents that generate new failure patterns continuously. Without a loop that adds production failures to the test set, the test set quickly drifts away from reality.
Treating cost and latency separately from quality. Many teams monitor only accuracy and then get a surprise infrastructure bill. As the CLEAR framework illustrates, it's important to include cost and latency in your evaluation metrics from the start.

Closing Thoughts

Making an agent trustworthy in production is as important an engineering task as building it in the first place. The good news is that Langfuse, DeepEval, and Arize Phoenix are all open source and self-hostable, so you can get started without additional cost. You don't need to upgrade to a paid plan — the basic features are sufficient to begin.

The three steps below have ordering dependencies. Tracing must come first so that online evaluation data can accumulate, and only once that data accumulates can the CI gate's golden dataset be improved against reality. It is recommended to proceed in this order.

Start by attaching tracing. After pip install langfuse, you can begin by adding a single @observe() decorator to an existing agent function. Simply visualizing the execution path in the dashboard will already reveal problems that were previously invisible.
Build a golden dataset of 10 cases and connect it to CI. Start with around 10 core scenarios, configure deepeval test run to execute automatically on every PR, and you have a regression-detection pipeline. The effective approach is to add one case each time a production failure occurs, gradually building a more realistic dataset.
Make a habit of viewing cost and latency dashboards on the same screen as quality metrics. Using Langfuse or Datadog LLM Observability, you can see token costs, P99 latency, and quality scores on a single screen. Once you develop the habit of viewing all three numbers together, you can make data-driven judgments about which prompt changes reduce costs while maintaining quality.

References

LangChain State of Agent Engineering 2026 — Statistics on agent production adoption and deployment barriers
Evaluating AI Agents: Real-World Lessons from Amazon | AWS Blog — Practical lessons including the necessity of step-by-step reasoning logs
Beyond Accuracy: Multi-Dimensional Framework for Evaluating Enterprise Agentic AI | arXiv — Original CLEAR framework paper; source of cost optimization figures
AI Agent Evaluation Guide | DeepEval by Confident AI — DeepEval official agent evaluation documentation
Agent Observability: LangSmith, Langfuse, Arize 2026 | Digital Applied — Comparison of major observability platforms
Top 8 LLM Observability Tools | LangChain Articles — List of LLM observability tools
How Can We Best Evaluate Agentic AI? | Brookings — Social and policy considerations for agent evaluation

Building an AI Agent Monitoring & Evaluation System: Catching Quality That Silently Breaks in Production with DeepEval and Langfuse | DEV BAK - 기술블로그

Building an AI Agent Monitoring & Evaluation System: Catching Quality That Silently Breaks in Production with DeepEval and Langfuse

Core Concepts

Why Agent Evaluation Differs from General LLM Evaluation

That's why agent evaluation looks at six major dimensions.

Evaluation Dimension	Description	Why It Matters
Task Completion Rate	Whether the given goal was achieved accurately	The most basic pass/fail criterion
Reasoning Path Quality	Whether the correct path was taken, not just the correct result	Distinguishes a lucky correct answer from a logically correct one
Hallucination Rate	How often content is generated that differs from the facts	Critical in sensitive domains like customer service, finance, and healthcare
Tool Call Accuracy	Whether appropriate tools are called with appropriate arguments	Incorrect tool calls create cascading errors
Cost & Latency	Token usage, response time, operational cost	There are teams that focused only on accuracy and got a 4x infrastructure bill
Safety & Compliance	Bias, harmful content, policy adherence	Mandatory pre-deployment check in regulated industries

Offline vs. Online Evaluation: Offline evaluation runs against a pre-built golden dataset (a reference set of input/output pairs representing expected behavior) before deployment, while online evaluation tracks real production traffic in real time. Running both is ideal, but in practice far more teams establish observability (online) first.

LLM-as-a-Judge: Having an LLM Evaluate Instead of a Human

python

from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    HallucinationMetric,
    ToolCorrectnessMetric,
)
from deepeval.test_case import LLMTestCase, ToolCall
 
test_case = LLMTestCase(
    input="서울 날씨 알려줘",
    actual_output="서울은 현재 맑고 기온은 22도입니다.",
    context=["weather_api 호출 결과: {'city': 'Seoul', 'temp': 22, 'condition': 'clear'}"],
    tools_called=[
        ToolCall(name="weather_api", input_parameters={"city": "Seoul"})
    ],
    expected_tools=[
        ToolCall(name="weather_api", input_parameters={"city": "Seoul"})
    ],
)
 
metrics = [
    AnswerRelevancyMetric(threshold=0.7, model="gpt-4o"),
    HallucinationMetric(threshold=0.3, model="gpt-4o"),
    ToolCorrectnessMetric(),  # LLM API 비용 없이 결정론적으로 평가
]
 
evaluate(test_cases=[test_case], metrics=metrics)

Self-Consistency Bias Warning: Evaluating a GPT-4o-based agent with GPT-4o can inflate scores. Models from the same company tend to rate their own outputs favorably. Whenever possible, use a model from a different company as the evaluator. For a GPT-4o agent, evaluate with Claude Opus; for a Claude agent, evaluate with GPT-4o.

Tracing Agent Execution Paths with OpenTelemetry

Root span: The entire user request (agent execution start to finish)
Reasoning span: Individual LLM call units (records prompt, token count, and latency)
Tool span: External API or function call units (records input parameters, return values, and duration)

Langfuse lets you apply this hierarchy with a single Python decorator.

python

from langfuse.decorators import observe, langfuse_context
from langfuse.openai import openai  # OpenAI 클라이언트 래핑
 
@observe()  # 루트 스팬: 이 함수 전체가 하나의 트레이스로 기록됨
async def run_agent(user_query: str) -> str:
    langfuse_context.update_current_trace(
        name="customer-support-agent",
        user_id="user-123",
        tags=["production", "v2.1"],
    )
 
    # 도구 스팬: 외부 도구 호출을 별도로 기록
    with langfuse_context.span("tool:search_kb"):
        kb_results = await search_knowledge_base(user_query)
 
    # 추론 스팬: LLM 호출 시 토큰·비용이 자동으로 기록됨
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "당신은 고객 지원 에이전트입니다."},
            {"role": "user", "content": user_query},
            {"role": "assistant", "content": f"검색 결과: {kb_results}"},
        ],
    )
 
    return response.choices[0].message.content

The CLEAR Framework: Focusing Only on Accuracy Leads to Failure

The CLEAR framework is the multi-dimensional evaluation approach proposed in that paper for enterprise environments.

Item	Meaning	How to Measure
Cost	Token costs, infrastructure costs	Average cost per request, total monthly cost
Latency	Response time	P50/P95/P99 latency distribution
Efficacy	Goal achievement rate	Task completion rate, user satisfaction
Assurance	Safety, compliance	Policy violation rate, bias score
Reliability	Stability, consistency	Error rate, response consistency for identical inputs

All five items must be viewed together on one screen. That is the key.

Practical Application

Note: All examples below are written for Python. If you are running a JavaScript/TypeScript agent, it is recommended to consult the official JS SDK documentation for each tool alongside this guide.

Setting Up an Agent Quality Gate in CI/CD with DeepEval

python

# test_agent_eval.py
import pytest
from deepeval import assert_test
from deepeval.metrics import (
    AnswerRelevancyMetric,
    HallucinationMetric,
    GEval,
)
from deepeval.test_case import LLMTestCase, ToolCall
 
# --- 실제 에이전트 모듈로 교체하세요 ---
async def run_customer_support_agent(query: str) -> str:
    return "30일 이내 구매 시 영수증 지참하면 전액 환불 가능합니다."
# ----------------------------------------
 
# 골든 데이터셋: 핵심 시나리오를 입력-컨텍스트-기대출력 쌍으로 정의
TEST_CASES = [
    {
        "input": "환불 정책이 어떻게 되나요?",
        "context": ["30일 이내 영수증 지참 시 전액 환불 가능"],
        "expected_output": "30일 이내 구매 시 환불 가능",
    },
    {
        "input": "주문 취소는 어떻게 하나요?",
        "context": ["배송 시작 전 앱에서 직접 취소 가능, 배송 후는 반품 절차 진행"],
        "expected_output": "배송 시작 전 앱에서 취소 가능",
    },
]
 
@pytest.mark.parametrize("case", TEST_CASES)
@pytest.mark.asyncio
async def test_customer_support_agent(case):
    actual_output = await run_customer_support_agent(case["input"])
 
    test_case = LLMTestCase(
        input=case["input"],
        actual_output=actual_output,
        context=case["context"],
        expected_output=case["expected_output"],
    )
 
    # 커스텀 평가 기준도 자연어로 정의할 수 있는데, 처음 짰을 때 꽤 놀랐습니다
    policy_adherence = GEval(
        name="PolicyAdherence",
        criteria="응답이 제공된 컨텍스트의 정책을 정확히 반영하는가",
        evaluation_params=["actual_output", "context"],
        threshold=0.8,
    )
 
    assert_test(
        test_case,
        metrics=[
            AnswerRelevancyMetric(threshold=0.7, model="gpt-4o-mini"),
            HallucinationMetric(threshold=0.2, model="gpt-4o-mini"),
            policy_adherence,
        ],
    )

Register this file with GitHub Actions and agent quality will be automatically verified on every PR. If a metric score falls below its threshold, the deployment is blocked.

yaml

# .github/workflows/agent-eval.yml
name: Agent Evaluation
 
on: [pull_request]
 
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - name: Install dependencies
        run: pip install deepeval langfuse openai pytest pytest-asyncio
      - name: Run agent evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}
        run: deepeval test run test_agent_eval.py

Real-Time Production Agent Tracing with Langfuse

python

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
import asyncio
import json
 
langfuse = Langfuse()
 
@observe()
async def run_agent_with_online_eval(user_id: str, query: str) -> str:
    # --- 실제 에이전트 호출로 교체하세요 ---
    result = f"'{query}'에 대한 고객 지원 응답입니다."
    # ----------------------------------------
 
    trace_id = langfuse_context.get_current_trace_id()
 
    # 사용자 응답 지연 없이 백그라운드에서 품질 평가 실행
    asyncio.create_task(
        run_online_evaluation(trace_id, query, result)
    )
 
    return result
 
async def run_online_evaluation(trace_id: str, query: str, response: str):
    """백그라운드에서 LLM-as-a-Judge 평가 실행"""
    from openai import AsyncOpenAI
 
    client = AsyncOpenAI()
 
    eval_prompt = f"""
    사용자 질문: {query}
    에이전트 응답: {response}
 
    다음 기준으로 응답 품질을 0.0~1.0 사이 점수로만 답해주세요:
    - 질문에 대한 관련성 (0.4)
    - 정확성과 사실 기반 (0.4)
    - 간결하고 명확한 설명 (0.2)
 
    JSON 형식: {{"score": 0.85, "reason": "간략 이유"}}
    """
 
    eval_response = await client.chat.completions.create(
        model="gpt-4o-mini",  # 비용 절감을 위해 미니 모델 활용
        messages=[{"role": "user", "content": eval_prompt}],
        response_format={"type": "json_object"},
    )
 
    result = json.loads(eval_response.choices[0].message.content)
 
    langfuse.score(
        trace_id=trace_id,
        name="online-quality-score",
        value=result["score"],
        comment=result["reason"],
    )
 
    if result["score"] < 0.6:
        await send_alert(trace_id, result["score"], result["reason"])

Behavioral Drift: This is a phenomenon where an agent's response patterns gradually shift over time from how it behaved at initial deployment. It's not a simple error but a subtle change, making it hard to catch. If the average quality score is slowly declining on a week-by-week basis, drift is worth suspecting.

Building a Dedicated Test Set for Tool Call Accuracy

python

from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import ToolCorrectnessMetric
 
# 고객 지원 에이전트의 도구 호출 시나리오
tool_test_cases = [
    LLMTestCase(
        input="지난달 주문 내역 조회해줘",
        actual_output="지난달 주문 3건을 찾았습니다...",
        tools_called=[
            ToolCall(
                name="get_orders",
                input_parameters={
                    "user_id": "user-123",
                    "date_range": "last_month",
                },
            )
        ],
        expected_tools=[
            ToolCall(
                name="get_orders",
                input_parameters={
                    "user_id": "user-123",
                    "date_range": "last_month",
                },
            )
        ],
    ),
    LLMTestCase(
        input="이 제품 재고 있어?",
        actual_output="현재 재고가 50개 있습니다.",
        tools_called=[
            ToolCall(
                name="check_inventory",
                input_parameters={"product_id": "PROD-456"},
            )
        ],
        expected_tools=[
            ToolCall(
                name="check_inventory",
                input_parameters={"product_id": "PROD-456"},
            )
        ],
    ),
]
 
# ToolCorrectnessMetric은 LLM API 비용 없이 결정론적으로 평가
metric = ToolCorrectnessMetric()
for tc in tool_test_cases:
    metric.measure(tc)
    print(f"입력: {tc.input}")
    print(f"도구 정확도: {metric.score:.2f} - {metric.reason}\n")

Pros and Cons

Advantages

Item	Details
Early Detection of Quality Bottlenecks	Offline evaluation detects regressions before deployment, preventing production incidents
Data for Cost Optimization	Token usage and latency data provide a basis for prompt and model selection decisions
Building Trust and Explainability	Behavioral trace logs serve as an audit trail for agent outputs, useful for regulatory compliance
Continuous Improvement Loop	Creates a flywheel where production failures are collected and used to strengthen the test set

Disadvantages and Caveats

Item	Details	Mitigation
Non-deterministic Nature	Agent outputs vary even for identical inputs, making simple comparisons difficult	Design probabilistic evaluations; use averages across multiple runs
Evaluation Cost	LLM-as-a-Judge incurs API costs; significant expense at scale	Use gpt-4o-mini for first-pass filtering, then review with a larger model
Benchmark-Production Gap	High benchmark scores do not guarantee successful real-world deployment	Continuously add production failure cases to the test set
Multi-step Tracing Complexity	Building tracing infrastructure for execution paths involving interleaved sub-agents is complex	Adopt the OpenTelemetry standard; expand tracing scope incrementally
Security & Privacy	Traces contain user input, creating personal data concerns	Mask sensitive data; formalize data retention policies
Vendor Lock-in	Deep integration with specific frameworks incurs migration costs	Maintain a standard OpenTelemetry-based layer; prefer self-hostable tools

Survivorship Bias: Online evaluation only collects input from people who actually used the agent. It cannot capture which types of users gave up and left, or which questions were never even entered. Relying solely on online evaluation data means you may permanently miss edge cases.

The Most Common Mistakes in Practice

Evaluating only the final output while ignoring the reasoning path. Just because an agent happened to give the right answer doesn't mean it reasoned correctly. If you don't also track the quality of intermediate steps, you accumulate successes you can't reproduce.
Building a golden dataset once and never updating it. Real users find new ways to use agents that generate new failure patterns continuously. Without a loop that adds production failures to the test set, the test set quickly drifts away from reality.
Treating cost and latency separately from quality. Many teams monitor only accuracy and then get a surprise infrastructure bill. As the CLEAR framework illustrates, it's important to include cost and latency in your evaluation metrics from the start.

Closing Thoughts

Start by attaching tracing. After pip install langfuse, you can begin by adding a single @observe() decorator to an existing agent function. Simply visualizing the execution path in the dashboard will already reveal problems that were previously invisible.
Build a golden dataset of 10 cases and connect it to CI. Start with around 10 core scenarios, configure deepeval test run to execute automatically on every PR, and you have a regression-detection pipeline. The effective approach is to add one case each time a production failure occurs, gradually building a more realistic dataset.
Make a habit of viewing cost and latency dashboards on the same screen as quality metrics. Using Langfuse or Datadog LLM Observability, you can see token costs, P99 latency, and quality scores on a single screen. Once you develop the habit of viewing all three numbers together, you can make data-driven judgments about which prompt changes reduce costs while maintaining quality.

References

LangChain State of Agent Engineering 2026 — Statistics on agent production adoption and deployment barriers
Evaluating AI Agents: Real-World Lessons from Amazon | AWS Blog — Practical lessons including the necessity of step-by-step reasoning logs
Beyond Accuracy: Multi-Dimensional Framework for Evaluating Enterprise Agentic AI | arXiv — Original CLEAR framework paper; source of cost optimization figures
AI Agent Evaluation Guide | DeepEval by Confident AI — DeepEval official agent evaluation documentation
Agent Observability: LangSmith, Langfuse, Arize 2026 | Digital Applied — Comparison of major observability platforms
Top 8 LLM Observability Tools | LangChain Articles — List of LLM observability tools
How Can We Best Evaluate Agentic AI? | Brookings — Social and policy considerations for agent evaluation

Core Concepts

Why Agent Evaluation Differs from General LLM Evaluation

LLM-as-a-Judge: Having an LLM Evaluate Instead of a Human

Tracing Agent Execution Paths with OpenTelemetry

The CLEAR Framework: Focusing Only on Accuracy Leads to Failure

Practical Application

Setting Up an Agent Quality Gate in CI/CD with DeepEval

Real-Time Production Agent Tracing with Langfuse

Building a Dedicated Test Set for Tool Call Accuracy

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Why Agent Evaluation Differs from General LLM Evaluation

LLM-as-a-Judge: Having an LLM Evaluate Instead of a Human

Tracing Agent Execution Paths with OpenTelemetry

The CLEAR Framework: Focusing Only on Accuracy Leads to Failure

Practical Application

Setting Up an Agent Quality Gate in CI/CD with DeepEval

Real-Time Production Agent Tracing with Langfuse

Building a Dedicated Test Set for Tool Call Accuracy

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Five Inference Optimization Techniques to Double or Quadruple LLM Serving Throughput on the Same GPU — From Quantization to Speculative Decoding

Figma MCP Server + Claude Code/Cursor Integration: Building React Components from a Single Design URL

Boosting AI Component Reuse with Figma Code Connect — From Mapping to Measurement in Large-Scale Design Systems

Cutting the Design-to-Code Iteration Cycle by Up to 80% with Figma MCP: Practical Integration of Model Context Protocol and AI Coding Agents

Applying Coding Rules and Design Rules Simultaneously to AI Agents — How to Use CLAUDE.md and DESIGN.md Together for Claude Code Team Setup

DESIGN.md: The Agent-Native File Format That Makes AI Coding Agents Follow Brand Design Rules on Their Own