LLM Agent Backend Design Patterns: Surviving Production with ReAct, LangGraph, and Temporal

Honestly, when I first designed an AI agent backend, I approached it the same way I would a REST API — and paid for it. I thought, "Can't I just make one endpoint that calls the LLM?" But once I pushed it to production, timeouts, state management errors, and unpredictable execution flows all blew up at once. Concretely, here's what happened: a 30-second agent loop got caught by the API gateway timeout, so the client received an error while the Worker kept running; the session context stored in Redis was wiped on a single Worker restart, making it look to users as if the agent had completely forgotten the previous conversation. Tracking down the cause was its own problem — there was no way to trace which path the LLM had taken, making debugging extremely painful.

In this post, we'll look at how the four core components of an agent backend (LLM Core, Tool Calling, Memory Layer, and Orchestration) must fit together to build a system that survives production, and when to choose which tools — with real code. By the end, you'll be able to set up a structure that runs agents without timeouts, state management that recovers from mid-execution failures, and an environment where you can visually observe the execution flow.

If you have existing backend experience, you can follow along just fine. In fact, that experience serves as a great anchor for understanding "where the traditional approach breaks down."

Core Concepts

Why an Agent Backend Is Different from a Traditional Backend

A typical backend executes fixed business logic when a request arrives and returns a response. The execution path is predetermined, and the same input produces the same output. This makes unit testing easy and monitoring simple.

An agent backend is fundamentally different. The LLM decides at runtime "what to do next." Which tools to call, how many times to iterate, and when to stop all depend on the LLM's reasoning. The moment you accept this, you realize the design philosophy must change. For example, the question "what happens if this function fails?" becomes "how do we recover when this execution path goes in an unexpected direction?"

Component	Role
LLM Core	Reasoning, planning, deciding the next action
Tool Calling	Connecting to the real world via external APIs, databases, code execution, etc.
Memory Layer	Short-term (context window) and long-term (vector DB, KV Store) memory
Orchestration	Multi-agent routing, state machines, failure recovery

The ReAct Pattern: The De Facto Standard Execution Loop

No discussion of agent execution flow is complete without the ReAct pattern. It's a structure where the LLM cycles through Thought → Action → Observation, and most production systems today are built on this pattern.

When I first implemented this pattern in code, I fell into a trap: I only checked finish_reason == "stop" and didn't set an iteration limit, and ended up in a situation where the LLM couldn't break out of the loop. In practice, always set max_iterations.

python

# agent/core.py — ReAct 루프의 단순화된 개념 구조
class MaxIterationsExceeded(Exception):
    pass
 
async def react_loop(
    user_input: str,
    tools: list[Tool],
    max_iterations: int = 10
) -> str:
    messages = [{"role": "user", "content": user_input}]
    
    for _ in range(max_iterations):
        # Thought: LLM이 다음 행동을 추론
        response = await llm.invoke(messages, tools=tools)
        
        if response.finish_reason == "stop":
            return response.content
        
        # Action: 도구 호출
        for call in response.tool_calls:
            # Observation: 도구 실행 결과를 컨텍스트에 추가
            result = await execute_tool(call.name, call.args)
            messages.append({"role": "tool", "content": result})
        
        messages.append(response)
    
    raise MaxIterationsExceeded(f"{max_iterations}회 반복 초과, 강제 종료")

ReAct (Reasoning + Acting): An agent execution pattern where the LLM doesn't merely generate text, but instead reaches its goal by repeatedly cycling through "think → act → observe results." Proposed by a Google research team in 2022, most modern agent frameworks are built on this foundation.

Memory Layer: Design Short-Term and Long-Term Separately

I was confused about this at first too, but the industry standard is to design agent memory in two distinct layers.

Short-term memory: The current conversation history inside the LLM's context window. Managed per session with Redis for sub-millisecond read speeds.
Long-term memory: Domain knowledge, user preferences, and past decisions that must persist beyond a session. Embedded into a vector DB like pgvector or Pinecone and retrieved via similarity search. In practice, you need to co-design the choice of embedding model (e.g., text-embedding-3-small), the document chunking strategy (a 512-token sliding window is common), and the position where retrieved results are injected into the context window — typically at the front, before the user message.

python

# agent/memory.py
class AgentMemory:
    def __init__(self):
        # 실제 환경에서는 연결 설정 필요:
        # Redis(host="localhost", port=6379, decode_responses=True)
        self.short_term = Redis()        # 세션 컨텍스트, 빠른 읽기
        self.long_term = PGVector()      # 임베딩 기반 장기 기억
        self.checkpoint = Postgres()     # 실패 복구용 내구성 저장소
    
    async def retrieve_context(self, query: str, session_id: str) -> dict:
        # 단기: 현재 세션의 최근 메시지
        recent = await self.short_term.get(f"session:{session_id}")
        
        # 장기: 쿼리와 의미적으로 유사한 과거 기억 k=5개 검색
        # 검색 결과는 시스템 프롬프트 직후, 사용자 메시지 앞에 주입
        relevant = await self.long_term.similarity_search(query, k=5)
        
        return {"recent": recent, "relevant": relevant}

MCP: The New Standard for Tool Integration

The Model Context Protocol (MCP), published by Anthropic in November 2024, has rapidly become the standard for agent tool integration starting in 2025. It has since been donated to the Linux Foundation's AAIF, where it is evolving as an open standard.

The key difference from traditional function calling is that tool definitions are not hardcoded inside the LLM host code. An MCP server runs as an independent process and communicates with the LLM host via standard endpoints like tools/list and tools/call. This means the same MCP server can be connected to Claude, GPT-4o, Gemini, or any other model, and the tool implementation can be managed completely separately from the LLM logic.

python

# mcp_server/search_server.py — FastMCP로 간단한 MCP 서버 구현 예시
from mcp.server.fastmcp import FastMCP
 
mcp = FastMCP("search-server")
 
@mcp.tool()
async def web_search(query: str) -> str:
    """웹에서 최신 정보를 검색합니다."""
    results = await search_api.search(query)
    return results.format_as_text()

MCP (Model Context Protocol): A protocol that standardizes how LLMs interact with external tools and data sources. Just as REST APIs standardized communication between web services, MCP standardizes the connection between LLMs and tools.

Practical Application

Before diving into three examples, it helps to first discuss which tool to choose in which situation.

Situation	Recommended Choice
Simple tasks, quick prototype	FastAPI + asyncio.Queue (or BullMQ)
Conditional branching, retries, multi-agent	LangGraph
Dozens of steps, multi-day execution, zero data loss acceptable	Temporal

Escaping HTTP Timeouts with an Async Queue

Agent tasks can take anywhere from tens of seconds to several minutes. Handling this inside a synchronous HTTP handler will inevitably cause API gateway timeouts and client disconnection issues. The pattern that has proven itself in production is queue-based async decoupling: the HTTP endpoint immediately returns only a task_id, and the actual agent loop runs in a separate Worker process.

python

# agent/api.py — FastAPI 엔드포인트
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from redis.asyncio import Redis
import asyncio, json
 
app = FastAPI()
redis = Redis(host="localhost", port=6379, decode_responses=True)
 
@app.post("/agent/task")
async def submit_task(request: TaskRequest) -> dict:
    task_id = generate_task_id()
    
    # 즉시 큐에 푸시하고 task_id만 반환 — 에이전트 실행은 Worker가 담당
    await redis.lpush("agent:queue", json.dumps({
        "task_id": task_id,
        "input": request.input,
        "tools": request.tools
    }))
    
    return {"task_id": task_id, "status": "queued"}
 
@app.get("/agent/task/{task_id}/stream")
async def stream_result(task_id: str):
    # SSE로 에이전트 진행 상황 스트리밍
    # 프로덕션에서는 타임아웃(30s 등)과 연결 종료 처리도 추가 필요
    async def event_generator():
        pubsub = redis.pubsub()
        await pubsub.subscribe(f"agent:result:{task_id}")
        try:
            async for message in pubsub.listen():
                if message["type"] == "message":
                    yield f"data: {message['data']}\n\n"
                    if json.loads(message["data"]).get("done"):
                        break
        finally:
            await pubsub.unsubscribe(f"agent:result:{task_id}")
    
    return StreamingResponse(event_generator(), media_type="text/event-stream")

python

# agent/worker.py — 실제 에이전트 루프 실행 프로세스
async def agent_worker():
    while True:
        # brpop: 큐가 비어있으면 블로킹, 태스크 도착 시 즉시 처리
        _, task_data = await redis.brpop("agent:queue")
        task = json.loads(task_data)
        
        async for step_result in run_agent_loop(task):
            await redis.publish(
                f"agent:result:{task['task_id']}",
                json.dumps(step_result)
            )

Component	Role	Technology Choice
HTTP Endpoint	Accepts task, returns task_id	FastAPI / NestJS
Job Queue	Async task buffering	Redis / SQS
Worker	Executes the actual agent loop	Separate process/container
Streaming	Delivers progress to the client	SSE / WebSocket

Implementing Conditional Branching Multi-Agent Orchestration with LangGraph

Once the simple queue pattern is in place, the next wall you hit is complex branching logic. You often need flows that don't just "start B when A finishes," but instead route to different agents based on conditions and retry when quality falls below a threshold. LangGraph's graph-based state machine solves this problem cleanly.

python

# agent/graph.py — LangGraph 멀티 에이전트 파이프라인
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
 
# TypedDict: 상태 타입 명시, Annotated[list, operator.add]는 각 노드의 반환값을 리스트에 추가(append)하는 리듀서 선언
class AgentState(TypedDict):
    query: str
    retrieved_docs: list[str]
    draft_answer: str
    quality_score: float
    messages: Annotated[list, operator.add]
 
async def retriever_node(state: AgentState) -> AgentState:
    docs = await vector_db.similarity_search(state["query"])
    return {"retrieved_docs": docs}
 
async def summarizer_node(state: AgentState) -> AgentState:
    draft = await llm.invoke(
        f"다음 문서를 요약해 답변을 작성하세요:\n{state['retrieved_docs']}"
    )
    return {"draft_answer": draft}
 
async def critic_node(state: AgentState) -> AgentState:
    # LLM-as-judge 패턴: 별도 LLM 호출로 답변 품질을 0~1 점수로 평가
    # 프롬프트에 평가 기준(관련성, 완성도, 사실 정확성)을 명시하고 JSON으로 파싱
    score = await evaluate_quality(state["draft_answer"], state["query"])
    return {"quality_score": score}
 
def route_after_critic(state: AgentState) -> str:
    if state["quality_score"] < 0.7:
        return "retriever"  # 품질 미달 → 재질의
    return END
 
workflow = StateGraph(AgentState)
workflow.add_node("retriever", retriever_node)
workflow.add_node("summarizer", summarizer_node)
workflow.add_node("critic", critic_node)
 
workflow.set_entry_point("retriever")
workflow.add_edge("retriever", "summarizer")
workflow.add_edge("summarizer", "critic")
workflow.add_conditional_edges("critic", route_after_critic)
 
# redis_checkpointer: 각 노드 실행 결과를 Redis에 저장해 중간 실패 시 재개 가능
agent = workflow.compile(checkpointer=redis_checkpointer)

Checkpointing: Recording each step of agent execution to a persistent store. If a failure occurs mid-run, execution can resume from the last checkpoint, avoiding the cost of restarting a long task from scratch.

Guaranteeing Zero Data Loss with Temporal

Simple queues like BullMQ can lose in-progress tasks if the Worker process dies. For agent tasks that consist of dozens of steps, run over multiple days, or where losing intermediate results is absolutely unacceptable, a durable execution engine like Temporal is the right fit. If LangGraph solves "complexity of logic," Temporal solves "reliability of execution."

python

# agent/temporal_workflow.py
from temporalio import workflow, activity
from temporalio.common import RetryPolicy
from datetime import timedelta
from dataclasses import dataclass
 
@dataclass
class LLMResponse:
    content: str
    is_final: bool
    tool_name: str | None = None
    tool_args: dict | None = None
 
@activity.defn
async def call_llm(prompt: str) -> LLMResponse:
    raw = await llm.invoke(prompt)
    # LLM 응답을 파싱해 구조화된 LLMResponse로 변환
    return parse_llm_response(raw)
 
@activity.defn
async def execute_tool(tool_name: str, args: dict) -> str:
    return await tool_registry.execute(tool_name, args)
 
@workflow.defn
class AgentWorkflow:
    @workflow.run
    async def run(self, user_input: str) -> str:
        messages = [user_input]
        
        for _ in range(10):  # 무한 루프 방지
            # 각 Activity는 실패 시 자동 재시도, 결과는 영속적으로 저장
            response: LLMResponse = await workflow.execute_activity(
                call_llm,
                args=[str(messages)],
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy=RetryPolicy(maximum_attempts=3)
            )
            
            if response.is_final:
                return response.content
            
            tool_result = await workflow.execute_activity(
                execute_tool,
                args=[response.tool_name, response.tool_args],
                start_to_close_timeout=timedelta(seconds=60)
            )
            messages.append(tool_result)
        
        return "최대 반복 횟수 초과"

Pros and Cons

Advantages

Item	Description
Flexibility	Without fixed logic, the LLM dynamically forms tool combinations and plans at runtime
Scalability	Multi-agent parallel processing of subtasks handles complex workflows
Knowledge utilization	RAG + long-term memory continuously accumulates domain expertise across sessions
Complex workflows	Capable of autonomously handling hundreds of steps with dependencies

Disadvantages and Caveats

Item	Description	Mitigation
Non-deterministic behavior	The same input can generate different execution paths; traditional unit testing is not applicable	Build a separate Eval pipeline with tools like LangSmith or Arize Phoenix
Lack of observability	Per LangChain surveys, fewer than 1 in 3 implementers are satisfied	Combine OpenTelemetry with a dedicated platform (LangSmith, Maxim AI)
Orchestration complexity	Multi-agent coordination overhead can become a bigger bottleneck than LLM calls	Measure per-step performance, minimize agent count, split only where necessary
Autonomous completion rate limits	Per Carnegie Mellon benchmarks, even top-performing agents autonomously complete only 30–35% of multi-step tasks	Design with Human-in-the-loop as a baseline assumption
State management complexity	Cross-session memory consistency, checkpoint recovery on failure	Redis (speed) + Postgres (durability) hybrid design
Security	Risks of prompt injection and tool misuse	ABAC access control, Zero Retention techniques, add a guardrail layer

The table alone can feel abstract, but the issue I personally suffered through most was "lack of observability." Debugging without any way to see why the agent called the wrong tool, or what context led to a bad decision, was like wrestling with a black box. There are moments when setting a single environment variable, LANGCHAIN_TRACING_V2=true, is worth more than a line of code.

ABAC (Attribute-Based Access Control): A method of determining access permissions by combining attributes of the user, resource, and environment. It allows fine-grained control over which tools an agent can access and with what scope, making it a strong fit as a security layer for agent backends.

Zero Retention: A technique for configuring LLM API calls so that input/output data is not stored on the provider's servers. This is an essential consideration whenever sensitive enterprise data passes through an agent.

The Most Common Production Mistakes

Handling the agent loop with synchronous HTTP — Agent tasks can take several minutes. Without queue-based async decoupling, you will inevitably run into API gateway timeouts.
Deferring memory design for later — The short-term/long-term memory architecture must be decided at the initial design stage. Adding it later makes consistency problems between session state and the vector index extremely complex.
Proliferating too many agents — Contrary to the intuition that "more agents means better results," the coordination overhead between agents frequently becomes the bottleneck. It's better to start with a single agent and split only at points where parallel processing is genuinely needed.

Closing Thoughts

The core of agent backend design is accepting the LLM's non-deterministic nature and building observable, recoverable infrastructure on top of it.

Trying to build a perfect system from the start leads easily to over-engineering. The following incremental approach is the production-validated path.

Start by building a single agent + async queue — After pip install langgraph redis, you can start by connecting a simple ReAct loop to Python's asyncio.Queue and decoupling it from the HTTP endpoint. Thinking about complex orchestration after this foundation is stable is never too late.
Set up observability before writing code — You can enable LangSmith tracing with just the LANGCHAIN_TRACING_V2=true environment variable. Without an environment where you can visually confirm which path the agent is taking, debugging becomes extremely difficult. This is not optional — it's essential.
Design the memory layer as a Redis + pgvector combination — Spin up a local environment with docker compose up redis postgres, then implement a structure where session context is stored in Redis and embedded long-term memory in pgvector. Walking through this yourself is a great way to build intuition for memory design.

References

Next post: From LLM-as-Judge to latency tracing — a practical guide to building an agent Eval pipeline (covering how to test non-deterministic systems and establish quality standards)

LLM Agent Backend Design Patterns: Surviving Production with ReAct, LangGraph, and Temporal | DEV BAK - 기술블로그

Backend

LLM Agent Backend Design Patterns: Surviving Production with ReAct, LangGraph, and Temporal

If you have existing backend experience, you can follow along just fine. In fact, that experience serves as a great anchor for understanding "where the traditional approach breaks down."

Core Concepts

Why an Agent Backend Is Different from a Traditional Backend

Component	Role
LLM Core	Reasoning, planning, deciding the next action
Tool Calling	Connecting to the real world via external APIs, databases, code execution, etc.
Memory Layer	Short-term (context window) and long-term (vector DB, KV Store) memory
Orchestration	Multi-agent routing, state machines, failure recovery

The ReAct Pattern: The De Facto Standard Execution Loop

python

# agent/core.py — ReAct 루프의 단순화된 개념 구조
class MaxIterationsExceeded(Exception):
    pass
 
async def react_loop(
    user_input: str,
    tools: list[Tool],
    max_iterations: int = 10
) -> str:
    messages = [{"role": "user", "content": user_input}]
    
    for _ in range(max_iterations):
        # Thought: LLM이 다음 행동을 추론
        response = await llm.invoke(messages, tools=tools)
        
        if response.finish_reason == "stop":
            return response.content
        
        # Action: 도구 호출
        for call in response.tool_calls:
            # Observation: 도구 실행 결과를 컨텍스트에 추가
            result = await execute_tool(call.name, call.args)
            messages.append({"role": "tool", "content": result})
        
        messages.append(response)
    
    raise MaxIterationsExceeded(f"{max_iterations}회 반복 초과, 강제 종료")

ReAct (Reasoning + Acting): An agent execution pattern where the LLM doesn't merely generate text, but instead reaches its goal by repeatedly cycling through "think → act → observe results." Proposed by a Google research team in 2022, most modern agent frameworks are built on this foundation.

Memory Layer: Design Short-Term and Long-Term Separately

I was confused about this at first too, but the industry standard is to design agent memory in two distinct layers.

Short-term memory: The current conversation history inside the LLM's context window. Managed per session with Redis for sub-millisecond read speeds.
Long-term memory: Domain knowledge, user preferences, and past decisions that must persist beyond a session. Embedded into a vector DB like pgvector or Pinecone and retrieved via similarity search. In practice, you need to co-design the choice of embedding model (e.g., text-embedding-3-small), the document chunking strategy (a 512-token sliding window is common), and the position where retrieved results are injected into the context window — typically at the front, before the user message.

python

# agent/memory.py
class AgentMemory:
    def __init__(self):
        # 실제 환경에서는 연결 설정 필요:
        # Redis(host="localhost", port=6379, decode_responses=True)
        self.short_term = Redis()        # 세션 컨텍스트, 빠른 읽기
        self.long_term = PGVector()      # 임베딩 기반 장기 기억
        self.checkpoint = Postgres()     # 실패 복구용 내구성 저장소
    
    async def retrieve_context(self, query: str, session_id: str) -> dict:
        # 단기: 현재 세션의 최근 메시지
        recent = await self.short_term.get(f"session:{session_id}")
        
        # 장기: 쿼리와 의미적으로 유사한 과거 기억 k=5개 검색
        # 검색 결과는 시스템 프롬프트 직후, 사용자 메시지 앞에 주입
        relevant = await self.long_term.similarity_search(query, k=5)
        
        return {"recent": recent, "relevant": relevant}

MCP: The New Standard for Tool Integration

python

# mcp_server/search_server.py — FastMCP로 간단한 MCP 서버 구현 예시
from mcp.server.fastmcp import FastMCP
 
mcp = FastMCP("search-server")
 
@mcp.tool()
async def web_search(query: str) -> str:
    """웹에서 최신 정보를 검색합니다."""
    results = await search_api.search(query)
    return results.format_as_text()

MCP (Model Context Protocol): A protocol that standardizes how LLMs interact with external tools and data sources. Just as REST APIs standardized communication between web services, MCP standardizes the connection between LLMs and tools.

Practical Application

Before diving into three examples, it helps to first discuss which tool to choose in which situation.

Situation	Recommended Choice
Simple tasks, quick prototype	FastAPI + asyncio.Queue (or BullMQ)
Conditional branching, retries, multi-agent	LangGraph
Dozens of steps, multi-day execution, zero data loss acceptable	Temporal

Escaping HTTP Timeouts with an Async Queue

python

# agent/api.py — FastAPI 엔드포인트
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from redis.asyncio import Redis
import asyncio, json
 
app = FastAPI()
redis = Redis(host="localhost", port=6379, decode_responses=True)
 
@app.post("/agent/task")
async def submit_task(request: TaskRequest) -> dict:
    task_id = generate_task_id()
    
    # 즉시 큐에 푸시하고 task_id만 반환 — 에이전트 실행은 Worker가 담당
    await redis.lpush("agent:queue", json.dumps({
        "task_id": task_id,
        "input": request.input,
        "tools": request.tools
    }))
    
    return {"task_id": task_id, "status": "queued"}
 
@app.get("/agent/task/{task_id}/stream")
async def stream_result(task_id: str):
    # SSE로 에이전트 진행 상황 스트리밍
    # 프로덕션에서는 타임아웃(30s 등)과 연결 종료 처리도 추가 필요
    async def event_generator():
        pubsub = redis.pubsub()
        await pubsub.subscribe(f"agent:result:{task_id}")
        try:
            async for message in pubsub.listen():
                if message["type"] == "message":
                    yield f"data: {message['data']}\n\n"
                    if json.loads(message["data"]).get("done"):
                        break
        finally:
            await pubsub.unsubscribe(f"agent:result:{task_id}")
    
    return StreamingResponse(event_generator(), media_type="text/event-stream")

python

# agent/worker.py — 실제 에이전트 루프 실행 프로세스
async def agent_worker():
    while True:
        # brpop: 큐가 비어있으면 블로킹, 태스크 도착 시 즉시 처리
        _, task_data = await redis.brpop("agent:queue")
        task = json.loads(task_data)
        
        async for step_result in run_agent_loop(task):
            await redis.publish(
                f"agent:result:{task['task_id']}",
                json.dumps(step_result)
            )

Component	Role	Technology Choice
HTTP Endpoint	Accepts task, returns task_id	FastAPI / NestJS
Job Queue	Async task buffering	Redis / SQS
Worker	Executes the actual agent loop	Separate process/container
Streaming	Delivers progress to the client	SSE / WebSocket

Implementing Conditional Branching Multi-Agent Orchestration with LangGraph

python

# agent/graph.py — LangGraph 멀티 에이전트 파이프라인
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
 
# TypedDict: 상태 타입 명시, Annotated[list, operator.add]는 각 노드의 반환값을 리스트에 추가(append)하는 리듀서 선언
class AgentState(TypedDict):
    query: str
    retrieved_docs: list[str]
    draft_answer: str
    quality_score: float
    messages: Annotated[list, operator.add]
 
async def retriever_node(state: AgentState) -> AgentState:
    docs = await vector_db.similarity_search(state["query"])
    return {"retrieved_docs": docs}
 
async def summarizer_node(state: AgentState) -> AgentState:
    draft = await llm.invoke(
        f"다음 문서를 요약해 답변을 작성하세요:\n{state['retrieved_docs']}"
    )
    return {"draft_answer": draft}
 
async def critic_node(state: AgentState) -> AgentState:
    # LLM-as-judge 패턴: 별도 LLM 호출로 답변 품질을 0~1 점수로 평가
    # 프롬프트에 평가 기준(관련성, 완성도, 사실 정확성)을 명시하고 JSON으로 파싱
    score = await evaluate_quality(state["draft_answer"], state["query"])
    return {"quality_score": score}
 
def route_after_critic(state: AgentState) -> str:
    if state["quality_score"] < 0.7:
        return "retriever"  # 품질 미달 → 재질의
    return END
 
workflow = StateGraph(AgentState)
workflow.add_node("retriever", retriever_node)
workflow.add_node("summarizer", summarizer_node)
workflow.add_node("critic", critic_node)
 
workflow.set_entry_point("retriever")
workflow.add_edge("retriever", "summarizer")
workflow.add_edge("summarizer", "critic")
workflow.add_conditional_edges("critic", route_after_critic)
 
# redis_checkpointer: 각 노드 실행 결과를 Redis에 저장해 중간 실패 시 재개 가능
agent = workflow.compile(checkpointer=redis_checkpointer)

Checkpointing: Recording each step of agent execution to a persistent store. If a failure occurs mid-run, execution can resume from the last checkpoint, avoiding the cost of restarting a long task from scratch.

Guaranteeing Zero Data Loss with Temporal

python

# agent/temporal_workflow.py
from temporalio import workflow, activity
from temporalio.common import RetryPolicy
from datetime import timedelta
from dataclasses import dataclass
 
@dataclass
class LLMResponse:
    content: str
    is_final: bool
    tool_name: str | None = None
    tool_args: dict | None = None
 
@activity.defn
async def call_llm(prompt: str) -> LLMResponse:
    raw = await llm.invoke(prompt)
    # LLM 응답을 파싱해 구조화된 LLMResponse로 변환
    return parse_llm_response(raw)
 
@activity.defn
async def execute_tool(tool_name: str, args: dict) -> str:
    return await tool_registry.execute(tool_name, args)
 
@workflow.defn
class AgentWorkflow:
    @workflow.run
    async def run(self, user_input: str) -> str:
        messages = [user_input]
        
        for _ in range(10):  # 무한 루프 방지
            # 각 Activity는 실패 시 자동 재시도, 결과는 영속적으로 저장
            response: LLMResponse = await workflow.execute_activity(
                call_llm,
                args=[str(messages)],
                start_to_close_timeout=timedelta(seconds=30),
                retry_policy=RetryPolicy(maximum_attempts=3)
            )
            
            if response.is_final:
                return response.content
            
            tool_result = await workflow.execute_activity(
                execute_tool,
                args=[response.tool_name, response.tool_args],
                start_to_close_timeout=timedelta(seconds=60)
            )
            messages.append(tool_result)
        
        return "최대 반복 횟수 초과"

Pros and Cons

Advantages

Item	Description
Flexibility	Without fixed logic, the LLM dynamically forms tool combinations and plans at runtime
Scalability	Multi-agent parallel processing of subtasks handles complex workflows
Knowledge utilization	RAG + long-term memory continuously accumulates domain expertise across sessions
Complex workflows	Capable of autonomously handling hundreds of steps with dependencies

Disadvantages and Caveats

Item	Description	Mitigation
Non-deterministic behavior	The same input can generate different execution paths; traditional unit testing is not applicable	Build a separate Eval pipeline with tools like LangSmith or Arize Phoenix
Lack of observability	Per LangChain surveys, fewer than 1 in 3 implementers are satisfied	Combine OpenTelemetry with a dedicated platform (LangSmith, Maxim AI)
Orchestration complexity	Multi-agent coordination overhead can become a bigger bottleneck than LLM calls	Measure per-step performance, minimize agent count, split only where necessary
Autonomous completion rate limits	Per Carnegie Mellon benchmarks, even top-performing agents autonomously complete only 30–35% of multi-step tasks	Design with Human-in-the-loop as a baseline assumption
State management complexity	Cross-session memory consistency, checkpoint recovery on failure	Redis (speed) + Postgres (durability) hybrid design
Security	Risks of prompt injection and tool misuse	ABAC access control, Zero Retention techniques, add a guardrail layer

ABAC (Attribute-Based Access Control): A method of determining access permissions by combining attributes of the user, resource, and environment. It allows fine-grained control over which tools an agent can access and with what scope, making it a strong fit as a security layer for agent backends.

Zero Retention: A technique for configuring LLM API calls so that input/output data is not stored on the provider's servers. This is an essential consideration whenever sensitive enterprise data passes through an agent.

The Most Common Production Mistakes

Handling the agent loop with synchronous HTTP — Agent tasks can take several minutes. Without queue-based async decoupling, you will inevitably run into API gateway timeouts.
Deferring memory design for later — The short-term/long-term memory architecture must be decided at the initial design stage. Adding it later makes consistency problems between session state and the vector index extremely complex.
Proliferating too many agents — Contrary to the intuition that "more agents means better results," the coordination overhead between agents frequently becomes the bottleneck. It's better to start with a single agent and split only at points where parallel processing is genuinely needed.

Closing Thoughts

The core of agent backend design is accepting the LLM's non-deterministic nature and building observable, recoverable infrastructure on top of it.

Trying to build a perfect system from the start leads easily to over-engineering. The following incremental approach is the production-validated path.

Start by building a single agent + async queue — After pip install langgraph redis, you can start by connecting a simple ReAct loop to Python's asyncio.Queue and decoupling it from the HTTP endpoint. Thinking about complex orchestration after this foundation is stable is never too late.
Set up observability before writing code — You can enable LangSmith tracing with just the LANGCHAIN_TRACING_V2=true environment variable. Without an environment where you can visually confirm which path the agent is taking, debugging becomes extremely difficult. This is not optional — it's essential.
Design the memory layer as a Redis + pgvector combination — Spin up a local environment with docker compose up redis postgres, then implement a structure where session context is stored in Redis and embedded long-term memory in pgvector. Walking through this yourself is a great way to build intuition for memory design.

References

Next post: From LLM-as-Judge to latency tracing — a practical guide to building an agent Eval pipeline (covering how to test non-deterministic systems and establish quality standards)

Core Concepts

Why an Agent Backend Is Different from a Traditional Backend

The ReAct Pattern: The De Facto Standard Execution Loop

Memory Layer: Design Short-Term and Long-Term Separately

MCP: The New Standard for Tool Integration

Practical Application

Escaping HTTP Timeouts with an Async Queue

Implementing Conditional Branching Multi-Agent Orchestration with LangGraph

Guaranteeing Zero Data Loss with Temporal

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Production Mistakes

Closing Thoughts

References

Core Concepts

Why an Agent Backend Is Different from a Traditional Backend

The ReAct Pattern: The De Facto Standard Execution Loop

Memory Layer: Design Short-Term and Long-Term Separately

MCP: The New Standard for Tool Integration

Practical Application

Escaping HTTP Timeouts with an Async Queue

Implementing Conditional Branching Multi-Agent Orchestration with LangGraph

Guaranteeing Zero Data Loss with Temporal

Pros and Cons

Advantages

Disadvantages and Caveats

The Most Common Production Mistakes

Closing Thoughts

References

Recommended Posts

Safe Without GC — Building a High-Performance REST API Server with Rust + Axum (Tokio · SQLx · Real-World Code)

Serverless Edge Computing Implementation Guide — 5 Real-World Patterns for Achieving Global P99 30ms

A Practical Guide to Apache Kafka and Event-Driven Architecture for Breaking Microservice Coupling