Building Type-Safe AI Agents with PydanticAI — How We Caught 23 Bugs Before Production
Looking back on the first time I deployed an AI agent to production still gives me chills. I plastered try/except blocks all around the JSON-parsing code for LLM responses, and I can't count how many times the entire pipeline collapsed the moment a response came back in a slightly different format. Once, a confidence field came back as "high" instead of 0.87, and it blew up the entire downstream numerical calculation. That's when I thought, "What if we could enforce types on LLM output from the start?" — and PydanticAI hits exactly that point.
PydanticAI is a framework built by the Pydantic team — the same people behind FastAPI — with the philosophy of "doing for AI agent development what FastAPI did for web development." The V1 stable release came out in September 2025, and since then it has surpassed 15 million cumulative downloads and 16,000 GitHub stars. In terms of real-world data, a comparative measurement showed that for equivalent functionality (an analytics agent with structured output, dependency injection, and tool registration), PydanticAI produces 43% less code than LangGraph and caught 23 more type errors during development. If you're still parsing LLM responses as text while reading this, you're likely repeating the same class of bugs over and over.
In this article, we'll walk through PydanticAI's three core mechanisms — type-safe output, dependency injection, and tool registration — with real code examples, and give an honest breakdown of when to choose PydanticAI versus when to reach for another framework.
Core Concepts
Type-Safe Output: Enforcing LLM Responses into Pydantic Models
The biggest differentiator of PydanticAI is that it automatically maps LLM responses to a Pydantic BaseModel. It doesn't just parse JSON — it guarantees data integrity by retrying or raising type exceptions when the format is wrong.
from pydantic import BaseModel
from pydantic_ai import Agent
class ResponseModel(BaseModel):
answer: str
confidence: float
agent = Agent(
model="openai:gpt-4o",
result_type=ResponseModel,
system_prompt="You are a helpful assistant."
)
result = agent.run_sync("What is the capital of France?")
print(result.data.answer) # "Paris"
print(result.data.confidence) # 0.99In async environments, you can use
await agent.run()instead ofagent.run_sync(). In this article, we userun_sync()in synchronous contexts andawait agent.run()inside async functions. If Python async patterns are unfamiliar to you,run_sync()is more than enough to get started.
All you need to do is pass ResponseModel to result_type. If the LLM returns confidence as a string, PydanticAI will attempt to coerce it automatically, and if that's not possible, it raises a ValidationError and retries up to 3 times by default. I also initially thought, "Can't I just write 'respond in JSON format' in the prompt?" — but the key difference is that prompt instructions can be ignored by the LLM, whereas type validation is enforced at the code level.
Internally, structured output is handled through three paths:
| Path | How It Works | Best For |
|---|---|---|
| Tool call-based extraction | LLM returns data as a function call | Models with Function Calling support like OpenAI, Anthropic |
| Provider-managed JSON schema | Uses the model API's response_format |
Modern models with JSON Mode support |
| Prompt-injected formatting | Injects schema into the system prompt | Fallback for models without Function Calling support |
Pydantic BaseModel is the base class of Pydantic, a Python data validation library. Declare fields with type hints on the class, and type validation and coercion happen automatically at instance creation time.
Dependency Injection: Cleanly Separating Test and Production Code
One reason AI agents are hard to test is that external dependencies like DB connections and API clients get tangled up inside agent logic. Dependency Injection is a pattern where these external objects aren't hardcoded inside the code but are instead passed in from outside at runtime. PydanticAI solves this the same way FastAPI does.
from dataclasses import dataclass
from pydantic_ai import Agent, RunContext
@dataclass
class Deps:
db_conn: DatabaseConnection # Replace with your actual DB client
api_key: str
agent = Agent(
model="anthropic:claude-3-5-sonnet-latest",
deps_type=Deps
)
@agent.tool
async def get_user_data(ctx: RunContext[Deps], user_id: str) -> dict:
return await ctx.deps.db_conn.fetch(
"SELECT * FROM users WHERE id = $1", user_id
)
# Production run
result = await agent.run(
"Get data for user 123",
deps=Deps(db_conn=real_db, api_key="prod-key")
)
# Test run — just swap in a mock object
result = await agent.run(
"Get data for user 123",
deps=Deps(db_conn=mock_db, api_key="test-key")
)Those who've used it know how convenient it is in practice to run unit tests without a real DB connection. Declare the dependency type with deps_type, and inject the actual object at agent.run(deps=...) time — meaning in a test environment, you can simply swap in a mock object.
Tool Registration: Turning Python Functions into the LLM's Hands and Feet
With just the @agent.tool decorator, you can register an ordinary Python function as a tool the LLM can call. The LLM reads the function's docstring and type hints to decide when to use that tool.
@agent.tool
async def query_sales(ctx: RunContext[Deps], region: str) -> list[dict]:
"""Retrieves sales data for the specified region.
Args:
region: The name of the region to query (e.g., 'seoul', 'busan')
"""
return await ctx.deps.db_conn.fetch(
"SELECT * FROM sales WHERE region = $1", region
)
@agent.tool
async def calculate_growth_rate(
ctx: RunContext[Deps],
current: float,
previous: float
) -> float:
"""Calculates the growth rate compared to the previous period."""
return (current - previous) / previous * 100Function Calling (tool calling): A feature that instructs the LLM to call a specific function instead of generating text. If the LLM determines that "answering this question requires a DB query," it calls the
query_salesfunction, receives the result, and then generates the final response.
Let's look at how these three mechanisms combine in practice using financial domain code.
Real-World Application
Example 1: Financial Data Analysis Agent
In the financial domain, type safety is not optional — it's required. If revenue comes in as a string or growth_rate is missing as it flows into downstream systems, the consequences are serious. PydanticAI's Pydantic validation acts as that firewall.
from pydantic import BaseModel, Field
from pydantic_ai import Agent, RunContext
from dataclasses import dataclass
from typing import Optional
class SalesReport(BaseModel):
region: str
total_revenue: float = Field(ge=0, description="Total revenue (KRW)")
growth_rate: float = Field(description="Growth rate vs. previous period (%)")
top_product: str
risk_level: str = Field(pattern="^(low|medium|high)$")
summary: str
action_items: list[str]
@dataclass
class AnalyticsDeps:
db_conn: DatabaseConnection # Replace with your actual DB client
report_date: str
agent = Agent(
model="openai:gpt-4o",
result_type=SalesReport,
deps_type=AnalyticsDeps,
system_prompt="""
You are a financial analyst. Analyze sales data and provide
structured reports with actionable insights.
"""
)
@agent.tool
async def get_regional_sales(
ctx: RunContext[AnalyticsDeps],
region: str
) -> dict:
"""Retrieves sales data for a specific region."""
return await ctx.deps.db_conn.fetch(
"SELECT * FROM sales WHERE region = $1 AND date = $2",
region, ctx.deps.report_date
)
@agent.tool
async def get_previous_period(
ctx: RunContext[AnalyticsDeps],
region: str
) -> Optional[dict]:
"""Retrieves comparison data from the previous period."""
return await ctx.deps.db_conn.fetch_one(
"SELECT total_revenue FROM sales WHERE region = $1 AND date < $2 ORDER BY date DESC LIMIT 1",
region, ctx.deps.report_date
)
# Run
deps = AnalyticsDeps(db_conn=db, report_date="2026-05-01")
result = await agent.run("Analyze May sales performance for the Seoul region", deps=deps)
print(result.data.risk_level) # Guaranteed to be one of "low", "medium", "high"
print(result.data.growth_rate) # Guaranteed to be float type
print(result.data.action_items) # Guaranteed to be list[str] typeThe validation options on each field are what make this work:
| Code Element | Role |
|---|---|
Field(ge=0) |
Blocks negative revenue at the source |
Field(pattern="^(low|medium|high)$") |
Only passes allowed values (low, medium, high) |
result_type=SalesReport |
Validates the entire LLM output through Pydantic |
deps_type=AnalyticsDeps |
Decouples DB connection from agent logic |
Example 2: Human-in-the-Loop Approval Workflow
For sensitive operations (transfers, deletions, deployments, etc.), when you want to clearly distinguish between "situations where the agent can execute autonomously" and "situations that require human intervention," the structured output pattern provides a clean solution. The approach is to have the model fill in a requires_approval field based on business logic, and handle the branching at the code level.
from pydantic import BaseModel
from pydantic_ai import Agent
class TransferRequest(BaseModel):
from_account: str
to_account: str
amount: float
reason: str
requires_approval: bool
agent = Agent(
model="anthropic:claude-3-5-sonnet-latest",
result_type=TransferRequest,
system_prompt="""
Process transfer requests. For amounts over 1,000,000 KRW,
set requires_approval to True.
"""
)
result = await agent.run(
"Process a transfer of 1.5 million KRW from account A to account B"
)
transfer = result.data
if transfer.requires_approval:
# The point of human intervention is clearly expressed through types
await request_human_approval(transfer)
else:
await execute_transfer(transfer)A single requires_approval field lets you clearly distinguish "automatic processing" from "cases requiring human review" at the code level. Back when we parsed LLM response text for phrases like "approval required" to branch logic, we'd constantly struggle with ambiguous expressions like "it seems approval may be needed." In this structure, the bool type eliminates that ambiguity.
Pros and Cons Analysis
Advantages
| Item | Details |
|---|---|
| Type safety | 23 additional type errors caught during development (compared to equivalent implementations in LangGraph and CrewAI) |
| Code conciseness | PydanticAI 160 lines vs. LangGraph 280 lines vs. CrewAI 420 lines for equivalent functionality |
| Testability | Dependency injection structure enables unit testing without real APIs |
| Model agnostic | Supports all major models including OpenAI, Anthropic, Gemini, DeepSeek |
| Durability | Durable Agent support that preserves execution state through API failures and restarts |
| Free open source | MIT license, no additional cost beyond LLM API fees |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Multi-agent limitations | No built-in support for role-based multi-agent systems | Mix with CrewAI or LangGraph |
| Ecosystem size | Relatively smaller third-party ecosystem compared to LangChain's 300+ integrations | Expand external tool connectivity via MCP integration |
| Complex state management | LangGraph-level checkpointing and workflow state management requires manual implementation | Combine with LangGraph for complex workflows |
| Community resources | Fewer references and examples compared to LangChain | High-quality official documentation compensates |
The multi-agent limitation hit me hardest when I tried to connect and orchestrate three agents with different roles. In the end, I handled just that part with LangGraph and kept the individual agent logic in PydanticAI — a hybrid structure. The two frameworks work well together, so the combination itself isn't difficult.
Durable Agent: A feature that saves the in-progress state to a checkpoint during agent execution, so if an API failure or server restart occurs, execution can resume from where it left off. Particularly useful for long-running batch jobs and complex multi-step agents.
The Most Common Mistakes in Production
-
Creating an agent without
result_type— Type safety is the core of PydanticAI, but omittingresult_typemeans you're just getting a string back, no different from a regular LLM call. It's best to design the BaseModel first. -
Writing vague tool docstrings — The docstring is the basis on which the LLM decides which tool to use in which situation. Rather than vague descriptions like "retrieves data," writing something specific like "retrieves sales volume for the Seoul region in Q4 2024 within a given date range" improves tool selection accuracy.
-
Passing DB connections through global variables instead of using dependency injection — It seems convenient at first, but it makes it impossible to swap out the real DB for a mock in test code. Using
deps_typeandRunContextfrom the start saves a lot of pain later.
Closing Thoughts
Instead of "trust and parse" LLM responses, "prove and use" them — this is the shift PydanticAI proposes. If you're in a domain where data integrity matters (finance, healthcare, security) or you're a team that wants to build agents quickly with a testable structure, it's worth trying.
Three steps to get started right now:
pip install pydantic-ai— installation is a single line.- Define one
BaseModeland runAgent(result_type=YourModel)to immediately get a feel for LLM responses being bound to types. - If you have existing LangChain code, re-implement the simplest chain and directly compare the difference in code volume and how early type errors are caught.
References
- Pydantic AI Official Documentation | ai.pydantic.dev
- Pydantic AI V1 Release Notes | pydantic.dev
- GitHub - pydantic/pydantic-ai
- Build Type-Safe LLM Agents | Real Python
- Pydantic AI Beginner's Guide | DataCamp
- PydanticAI v1 Framework Comparison | AgentMarketCap
- Pydantic AI vs LangGraph | ZenML
- Using PydanticAI in Production | DEV Community
- Pydantic MCP Integration Announcement | pydantic.dev
- Guide to Choosing an Agent Framework | Speakeasy