LLM Agent Backend Design Patterns: Surviving Production with ReAct, LangGraph, and Temporal
Honestly, when I first designed an AI agent backend, I approached it the same way I would a REST API — and paid for it. I thought, "Can't I just make one endpoint that calls the LLM?" But once I pushed it to production, timeouts, state management errors, and unpredictable execution flows all blew up at once. Concretely, here's what happened: a 30-second agent loop got caught by the API gateway timeout, so the client received an error while the Worker kept running; the session context stored in Redis was wiped on a single Worker restart, making it look to users as if the agent had completely forgotten the previous conversation. Tracking down the cause was its own problem — there was no way to trace which path the LLM had taken, making debugging extremely painful.
In this post, we'll look at how the four core components of an agent backend (LLM Core, Tool Calling, Memory Layer, and Orchestration) must fit together to build a system that survives production, and when to choose which tools — with real code. By the end, you'll be able to set up a structure that runs agents without timeouts, state management that recovers from mid-execution failures, and an environment where you can visually observe the execution flow.
If you have existing backend experience, you can follow along just fine. In fact, that experience serves as a great anchor for understanding "where the traditional approach breaks down."
Core Concepts
Why an Agent Backend Is Different from a Traditional Backend
A typical backend executes fixed business logic when a request arrives and returns a response. The execution path is predetermined, and the same input produces the same output. This makes unit testing easy and monitoring simple.
An agent backend is fundamentally different. The LLM decides at runtime "what to do next." Which tools to call, how many times to iterate, and when to stop all depend on the LLM's reasoning. The moment you accept this, you realize the design philosophy must change. For example, the question "what happens if this function fails?" becomes "how do we recover when this execution path goes in an unexpected direction?"
| Component | Role |
|---|---|
| LLM Core | Reasoning, planning, deciding the next action |
| Tool Calling | Connecting to the real world via external APIs, databases, code execution, etc. |
| Memory Layer | Short-term (context window) and long-term (vector DB, KV Store) memory |
| Orchestration | Multi-agent routing, state machines, failure recovery |
The ReAct Pattern: The De Facto Standard Execution Loop
No discussion of agent execution flow is complete without the ReAct pattern. It's a structure where the LLM cycles through Thought → Action → Observation, and most production systems today are built on this pattern.
When I first implemented this pattern in code, I fell into a trap: I only checked finish_reason == "stop" and didn't set an iteration limit, and ended up in a situation where the LLM couldn't break out of the loop. In practice, always set max_iterations.
# agent/core.py — ReAct 루프의 단순화된 개념 구조
class MaxIterationsExceeded(Exception):
pass
async def react_loop(
user_input: str,
tools: list[Tool],
max_iterations: int = 10
) -> str:
messages = [{"role": "user", "content": user_input}]
for _ in range(max_iterations):
# Thought: LLM이 다음 행동을 추론
response = await llm.invoke(messages, tools=tools)
if response.finish_reason == "stop":
return response.content
# Action: 도구 호출
for call in response.tool_calls:
# Observation: 도구 실행 결과를 컨텍스트에 추가
result = await execute_tool(call.name, call.args)
messages.append({"role": "tool", "content": result})
messages.append(response)
raise MaxIterationsExceeded(f"{max_iterations}회 반복 초과, 강제 종료")ReAct (Reasoning + Acting): An agent execution pattern where the LLM doesn't merely generate text, but instead reaches its goal by repeatedly cycling through "think → act → observe results." Proposed by a Google research team in 2022, most modern agent frameworks are built on this foundation.
Memory Layer: Design Short-Term and Long-Term Separately
I was confused about this at first too, but the industry standard is to design agent memory in two distinct layers.
- Short-term memory: The current conversation history inside the LLM's context window. Managed per session with Redis for sub-millisecond read speeds.
- Long-term memory: Domain knowledge, user preferences, and past decisions that must persist beyond a session. Embedded into a vector DB like pgvector or Pinecone and retrieved via similarity search. In practice, you need to co-design the choice of embedding model (e.g.,
text-embedding-3-small), the document chunking strategy (a 512-token sliding window is common), and the position where retrieved results are injected into the context window — typically at the front, before the user message.
# agent/memory.py
class AgentMemory:
def __init__(self):
# 실제 환경에서는 연결 설정 필요:
# Redis(host="localhost", port=6379, decode_responses=True)
self.short_term = Redis() # 세션 컨텍스트, 빠른 읽기
self.long_term = PGVector() # 임베딩 기반 장기 기억
self.checkpoint = Postgres() # 실패 복구용 내구성 저장소
async def retrieve_context(self, query: str, session_id: str) -> dict:
# 단기: 현재 세션의 최근 메시지
recent = await self.short_term.get(f"session:{session_id}")
# 장기: 쿼리와 의미적으로 유사한 과거 기억 k=5개 검색
# 검색 결과는 시스템 프롬프트 직후, 사용자 메시지 앞에 주입
relevant = await self.long_term.similarity_search(query, k=5)
return {"recent": recent, "relevant": relevant}MCP: The New Standard for Tool Integration
The Model Context Protocol (MCP), published by Anthropic in November 2024, has rapidly become the standard for agent tool integration starting in 2025. It has since been donated to the Linux Foundation's AAIF, where it is evolving as an open standard.
The key difference from traditional function calling is that tool definitions are not hardcoded inside the LLM host code. An MCP server runs as an independent process and communicates with the LLM host via standard endpoints like tools/list and tools/call. This means the same MCP server can be connected to Claude, GPT-4o, Gemini, or any other model, and the tool implementation can be managed completely separately from the LLM logic.
# mcp_server/search_server.py — FastMCP로 간단한 MCP 서버 구현 예시
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("search-server")
@mcp.tool()
async def web_search(query: str) -> str:
"""웹에서 최신 정보를 검색합니다."""
results = await search_api.search(query)
return results.format_as_text()MCP (Model Context Protocol): A protocol that standardizes how LLMs interact with external tools and data sources. Just as REST APIs standardized communication between web services, MCP standardizes the connection between LLMs and tools.
Practical Application
Before diving into three examples, it helps to first discuss which tool to choose in which situation.
| Situation | Recommended Choice |
|---|---|
| Simple tasks, quick prototype | FastAPI + asyncio.Queue (or BullMQ) |
| Conditional branching, retries, multi-agent | LangGraph |
| Dozens of steps, multi-day execution, zero data loss acceptable | Temporal |
Escaping HTTP Timeouts with an Async Queue
Agent tasks can take anywhere from tens of seconds to several minutes. Handling this inside a synchronous HTTP handler will inevitably cause API gateway timeouts and client disconnection issues. The pattern that has proven itself in production is queue-based async decoupling: the HTTP endpoint immediately returns only a task_id, and the actual agent loop runs in a separate Worker process.
# agent/api.py — FastAPI 엔드포인트
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from redis.asyncio import Redis
import asyncio, json
app = FastAPI()
redis = Redis(host="localhost", port=6379, decode_responses=True)
@app.post("/agent/task")
async def submit_task(request: TaskRequest) -> dict:
task_id = generate_task_id()
# 즉시 큐에 푸시하고 task_id만 반환 — 에이전트 실행은 Worker가 담당
await redis.lpush("agent:queue", json.dumps({
"task_id": task_id,
"input": request.input,
"tools": request.tools
}))
return {"task_id": task_id, "status": "queued"}
@app.get("/agent/task/{task_id}/stream")
async def stream_result(task_id: str):
# SSE로 에이전트 진행 상황 스트리밍
# 프로덕션에서는 타임아웃(30s 등)과 연결 종료 처리도 추가 필요
async def event_generator():
pubsub = redis.pubsub()
await pubsub.subscribe(f"agent:result:{task_id}")
try:
async for message in pubsub.listen():
if message["type"] == "message":
yield f"data: {message['data']}\n\n"
if json.loads(message["data"]).get("done"):
break
finally:
await pubsub.unsubscribe(f"agent:result:{task_id}")
return StreamingResponse(event_generator(), media_type="text/event-stream")# agent/worker.py — 실제 에이전트 루프 실행 프로세스
async def agent_worker():
while True:
# brpop: 큐가 비어있으면 블로킹, 태스크 도착 시 즉시 처리
_, task_data = await redis.brpop("agent:queue")
task = json.loads(task_data)
async for step_result in run_agent_loop(task):
await redis.publish(
f"agent:result:{task['task_id']}",
json.dumps(step_result)
)| Component | Role | Technology Choice |
|---|---|---|
| HTTP Endpoint | Accepts task, returns task_id | FastAPI / NestJS |
| Job Queue | Async task buffering | Redis / SQS |
| Worker | Executes the actual agent loop | Separate process/container |
| Streaming | Delivers progress to the client | SSE / WebSocket |
Implementing Conditional Branching Multi-Agent Orchestration with LangGraph
Once the simple queue pattern is in place, the next wall you hit is complex branching logic. You often need flows that don't just "start B when A finishes," but instead route to different agents based on conditions and retry when quality falls below a threshold. LangGraph's graph-based state machine solves this problem cleanly.
# agent/graph.py — LangGraph 멀티 에이전트 파이프라인
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator
# TypedDict: 상태 타입 명시, Annotated[list, operator.add]는 각 노드의 반환값을 리스트에 추가(append)하는 리듀서 선언
class AgentState(TypedDict):
query: str
retrieved_docs: list[str]
draft_answer: str
quality_score: float
messages: Annotated[list, operator.add]
async def retriever_node(state: AgentState) -> AgentState:
docs = await vector_db.similarity_search(state["query"])
return {"retrieved_docs": docs}
async def summarizer_node(state: AgentState) -> AgentState:
draft = await llm.invoke(
f"다음 문서를 요약해 답변을 작성하세요:\n{state['retrieved_docs']}"
)
return {"draft_answer": draft}
async def critic_node(state: AgentState) -> AgentState:
# LLM-as-judge 패턴: 별도 LLM 호출로 답변 품질을 0~1 점수로 평가
# 프롬프트에 평가 기준(관련성, 완성도, 사실 정확성)을 명시하고 JSON으로 파싱
score = await evaluate_quality(state["draft_answer"], state["query"])
return {"quality_score": score}
def route_after_critic(state: AgentState) -> str:
if state["quality_score"] < 0.7:
return "retriever" # 품질 미달 → 재질의
return END
workflow = StateGraph(AgentState)
workflow.add_node("retriever", retriever_node)
workflow.add_node("summarizer", summarizer_node)
workflow.add_node("critic", critic_node)
workflow.set_entry_point("retriever")
workflow.add_edge("retriever", "summarizer")
workflow.add_edge("summarizer", "critic")
workflow.add_conditional_edges("critic", route_after_critic)
# redis_checkpointer: 각 노드 실행 결과를 Redis에 저장해 중간 실패 시 재개 가능
agent = workflow.compile(checkpointer=redis_checkpointer)Checkpointing: Recording each step of agent execution to a persistent store. If a failure occurs mid-run, execution can resume from the last checkpoint, avoiding the cost of restarting a long task from scratch.
Guaranteeing Zero Data Loss with Temporal
Simple queues like BullMQ can lose in-progress tasks if the Worker process dies. For agent tasks that consist of dozens of steps, run over multiple days, or where losing intermediate results is absolutely unacceptable, a durable execution engine like Temporal is the right fit. If LangGraph solves "complexity of logic," Temporal solves "reliability of execution."
# agent/temporal_workflow.py
from temporalio import workflow, activity
from temporalio.common import RetryPolicy
from datetime import timedelta
from dataclasses import dataclass
@dataclass
class LLMResponse:
content: str
is_final: bool
tool_name: str | None = None
tool_args: dict | None = None
@activity.defn
async def call_llm(prompt: str) -> LLMResponse:
raw = await llm.invoke(prompt)
# LLM 응답을 파싱해 구조화된 LLMResponse로 변환
return parse_llm_response(raw)
@activity.defn
async def execute_tool(tool_name: str, args: dict) -> str:
return await tool_registry.execute(tool_name, args)
@workflow.defn
class AgentWorkflow:
@workflow.run
async def run(self, user_input: str) -> str:
messages = [user_input]
for _ in range(10): # 무한 루프 방지
# 각 Activity는 실패 시 자동 재시도, 결과는 영속적으로 저장
response: LLMResponse = await workflow.execute_activity(
call_llm,
args=[str(messages)],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(maximum_attempts=3)
)
if response.is_final:
return response.content
tool_result = await workflow.execute_activity(
execute_tool,
args=[response.tool_name, response.tool_args],
start_to_close_timeout=timedelta(seconds=60)
)
messages.append(tool_result)
return "최대 반복 횟수 초과"Pros and Cons
Advantages
| Item | Description |
|---|---|
| Flexibility | Without fixed logic, the LLM dynamically forms tool combinations and plans at runtime |
| Scalability | Multi-agent parallel processing of subtasks handles complex workflows |
| Knowledge utilization | RAG + long-term memory continuously accumulates domain expertise across sessions |
| Complex workflows | Capable of autonomously handling hundreds of steps with dependencies |
Disadvantages and Caveats
| Item | Description | Mitigation |
|---|---|---|
| Non-deterministic behavior | The same input can generate different execution paths; traditional unit testing is not applicable | Build a separate Eval pipeline with tools like LangSmith or Arize Phoenix |
| Lack of observability | Per LangChain surveys, fewer than 1 in 3 implementers are satisfied | Combine OpenTelemetry with a dedicated platform (LangSmith, Maxim AI) |
| Orchestration complexity | Multi-agent coordination overhead can become a bigger bottleneck than LLM calls | Measure per-step performance, minimize agent count, split only where necessary |
| Autonomous completion rate limits | Per Carnegie Mellon benchmarks, even top-performing agents autonomously complete only 30–35% of multi-step tasks | Design with Human-in-the-loop as a baseline assumption |
| State management complexity | Cross-session memory consistency, checkpoint recovery on failure | Redis (speed) + Postgres (durability) hybrid design |
| Security | Risks of prompt injection and tool misuse | ABAC access control, Zero Retention techniques, add a guardrail layer |
The table alone can feel abstract, but the issue I personally suffered through most was "lack of observability." Debugging without any way to see why the agent called the wrong tool, or what context led to a bad decision, was like wrestling with a black box. There are moments when setting a single environment variable, LANGCHAIN_TRACING_V2=true, is worth more than a line of code.
ABAC (Attribute-Based Access Control): A method of determining access permissions by combining attributes of the user, resource, and environment. It allows fine-grained control over which tools an agent can access and with what scope, making it a strong fit as a security layer for agent backends.
Zero Retention: A technique for configuring LLM API calls so that input/output data is not stored on the provider's servers. This is an essential consideration whenever sensitive enterprise data passes through an agent.
The Most Common Production Mistakes
- Handling the agent loop with synchronous HTTP — Agent tasks can take several minutes. Without queue-based async decoupling, you will inevitably run into API gateway timeouts.
- Deferring memory design for later — The short-term/long-term memory architecture must be decided at the initial design stage. Adding it later makes consistency problems between session state and the vector index extremely complex.
- Proliferating too many agents — Contrary to the intuition that "more agents means better results," the coordination overhead between agents frequently becomes the bottleneck. It's better to start with a single agent and split only at points where parallel processing is genuinely needed.
Closing Thoughts
The core of agent backend design is accepting the LLM's non-deterministic nature and building observable, recoverable infrastructure on top of it.
Trying to build a perfect system from the start leads easily to over-engineering. The following incremental approach is the production-validated path.
-
Start by building a single agent + async queue — After
pip install langgraph redis, you can start by connecting a simple ReAct loop to Python'sasyncio.Queueand decoupling it from the HTTP endpoint. Thinking about complex orchestration after this foundation is stable is never too late. -
Set up observability before writing code — You can enable LangSmith tracing with just the
LANGCHAIN_TRACING_V2=trueenvironment variable. Without an environment where you can visually confirm which path the agent is taking, debugging becomes extremely difficult. This is not optional — it's essential. -
Design the memory layer as a Redis + pgvector combination — Spin up a local environment with
docker compose up redis postgres, then implement a structure where session context is stored in Redis and embedded long-term memory in pgvector. Walking through this yourself is a great way to build intuition for memory design.
References
- AI Agent Architecture: Build Systems That Work in 2026 | Redis
- The Architectural Shift: AI Agents Become Execution Engines | InfoQ
- Architecture overview — Model Context Protocol 공식 문서
- AI Engineering Trends in 2025: Agents, MCP and Vibe Coding | The New Stack
- Designing AI-Native Backends: Architecture Patterns for Production LLMs | Medium
- Event-Driven Architecture for AI Agents: Production Patterns | Sandipan Haldar
- State Management Patterns for AI Agents | AI Fluens
- 5 Production Scaling Challenges for Agentic AI in 2026 | MachineLearningMastery
- Design Patterns for Long-Term Memory in LLM-Powered Architectures | Serokell
- State of Agent Engineering | LangChain
- AI Agent Orchestration Patterns | Azure Architecture Center
- Spring AI Agentic Patterns Part 6: AutoMemoryTools | Spring.io
- Top 5 AI Agent Observability Platforms in 2026 | Maxim AI
- Best AI Agent Frameworks 2025: LangGraph, CrewAI, OpenAI, LlamaIndex, AutoGen | Maxim AI
Next post: From LLM-as-Judge to latency tracing — a practical guide to building an agent Eval pipeline (covering how to test non-deterministic systems and establish quality standards)