Apache Kafka as an AI Agent Event Broker — Scaling MCP·A2A Multi-Agent Systems
Most people start with HTTP when first setting up a multi-agent system. Agent A calls B directly, B calls C, and so on. With the first two or three agents, things work pretty well. But once you go past ten, things start to feel off. Timeouts cascade, failures become impossible to trace, and every time you add a new agent you have to touch existing code. I went through that myself, and that's when it clicked — inter-agent communication carries exactly the same problems as microservices. If you're dealing with ten or more agents, or a scale that demands thousands of events per second, this post should be useful.
When Google open-sourced the A2A protocol in April 2025, this stack began to genuinely converge. Over 50 partners — Confluent, LangChain, CrewAI, SAP, Salesforce, and more — immediately announced support, and alongside Anthropic's MCP, Apache Kafka is establishing itself as the event broker for the agent ecosystem. This post explores why Apache Kafka is well-suited for that role and what architectural patterns it enables together with MCP and A2A, grounded in real-world examples.
Core Concepts
Separation of Concerns Across Four Layers
When you first encounter this stack, MCP and A2A can look similar enough to be confusing. I had the same reaction at first — "aren't they both just agent communication?" — but their roles are clearly distinct.
| Layer | Protocol/Technology | Role | Defines |
|---|---|---|---|
| Tool Access | MCP (Anthropic) | Agent ↔ External World | How agents access APIs, DBs, and files |
| Agent Collaboration | A2A (Google) | Agent ↔ Agent | How agents discover each other and delegate tasks |
| Message Delivery | Apache Kafka | Async message flow | How messages are delivered reliably |
| Stream Processing | Apache Flink | Real-time data transformation | How event streams are aggregated and transformed into agent inputs |
In one sentence: A2A defines the language agents speak, MCP standardizes how agents access tools, Kafka guarantees message flow, and Flink processes that flow.
MCP (Model Context Protocol): A protocol that standardizes how agents interact with external tools and data sources — think USB-C for agents. It solves the problem of frameworks like LangChain and CrewAI each connecting tools in their own incompatible ways.
A2A (Agent-to-Agent Protocol): An open protocol published by Google that enables agents built with different vendors or frameworks to collaborate. Agents advertise their capabilities via an Agent Card and delegate tasks through the Task API.
Apache Flink: An engine that processes, transforms, and aggregates Kafka streams in real time. LLM-based agents can't directly handle thousands of raw sensor or transaction events per second — the token cost and latency don't add up. Flink takes that burden so each agent only needs to receive a single, refined event.
Why HTTP Breaks Down at Agent Scale
The question I hear most often is: "Why Kafka at all? Isn't HTTP/REST enough?" Honestly, for small scale, HTTP is much simpler. But as the number of agents grows, HTTP-based architecture reproduces every spaghetti problem from microservices.
# HTTP-based (point-to-point)
AgentA → AgentB
AgentA → AgentC
AgentB → AgentC
AgentB → AgentD
...
# n agents = up to n*(n-1)/2 direct connections
# One agent failure = propagates to all connected agents
# Kafka-based (publish-subscribe)
AgentA → [Kafka topic: price-events] → AgentB, C, D (subscribers)
AgentB → [Kafka topic: analysis-results] → AgentE, F (subscribers)
# Agents only need to know the topic. They don't need to know each other.
# One agent failure = only that agent is affectedPublish-Subscribe (pub/sub): A messaging pattern where senders (publishers) and receivers (subscribers) don't need to know each other directly. Publishers drop messages onto a topic; subscribers pick up only the topics they care about.
Kafka changes three things. First, agents don't need to hold direct references to each other (loose coupling). Second, every message is durably recorded in the Kafka log, enabling auditing and replay. Third, agents and partitions can be scaled out independently and horizontally.
The Orchestrator-Worker Pattern on Kafka
The most representative pattern for composing agents on top of Kafka is orchestrator-worker.
┌──────────────────────────────────────────────────────┐
│ Kafka Topics │
│ │
│ [task-requests] [task-results] [dead-letter] │
└──────────┬──────────────┬─────────────────────────────┘
│ │
┌──────┴──────┐ │
│ Orchestrator│◄──────┘
│ Agent │ Collects results and
│ (A2A deleg.)│ decides next step
└──────┬──────┘
│ Publishes tasks
┌──────┴──────────────────────┐
│ │
┌───┴────┐ ┌────────┐ ┌────────┴┐
│Worker A│ │Worker B│ │Worker C │
│(analyze│ │(pricing│ │(notify) │
└────────┘ └────────┘ └─────────┘The orchestrator publishes tasks to the task-requests topic, and workers subscribe to the task types they can handle. Task assignment happens naturally through Kafka consumer group partition assignment, so the orchestrator never needs to address individual workers by name. Results are published back to the task-results topic, and the orchestrator aggregates them to decide the next step.
The most common question I get in practice is "what happens if some workers fail?" — and Kafka's offset management solves this. If a worker fails to process a message, it doesn't commit the offset, so the message becomes eligible for reprocessing. Messages that fail repeatedly can be routed to the dead-letter topic to quarantine them without blocking the overall flow.
A2A Agent Card: How Agents Discover Each Other
One of the central concepts in A2A is the Agent Card — a JSON document in which an agent declares its capabilities, served at /.well-known/agent.json.
{
"name": "PriceAnalysisAgent",
"description": "Analyzes competitor pricing data and generates price adjustment recommendations",
"url": "http://price-analysis-agent/",
"capabilities": {
"streaming": false,
"pushNotifications": true
},
"skills": [
{
"id": "analyze-price-gap",
"name": "Price Gap Analysis",
"description": "Accepts a product ID and competitor price, returns adjustment recommendations",
"inputModes": ["text"],
"outputModes": ["text"]
}
]
}At runtime, the orchestrator fetches this Agent Card to learn "which agent can handle which tasks" and delegates via A2A Tasks. The key advantage is that agents can be added or replaced without any changes to orchestrator code.
Real-World Applications
Example 1: E-Commerce Real-Time Dynamic Pricing
This is a real pattern I found compelling. It's a system that automatically responds when a competitor's price changes — and it's a great demonstration of how naturally the Kafka + A2A combination fits together.
# 1. Price Collector Agent (Kafka Producer)
# Scrapes competitor website via MCP, then publishes event
from confluent_kafka import Producer
import json
import uuid
class PriceCollectorAgent:
def __init__(self):
self.producer = Producer({'bootstrap.servers': 'kafka:9092'})
self.mcp_client = MCPClient("scraping-server")
async def collect_and_publish(self, competitor_url: str):
# MCP tool call: scrape competitor price
price_data = await self.mcp_client.call_tool(
"scrape_price",
{"url": competitor_url}
)
event = {
"event_type": "competitor_price_changed",
"competitor": competitor_url,
"new_price": price_data["price"],
"product_id": price_data["product_id"],
"timestamp": price_data["timestamp"],
"correlation_id": str(uuid.uuid4()) # for distributed tracing — always include from the start
}
self.producer.produce(
topic="ecommerce.price-changed.v1",
key=price_data["product_id"],
value=json.dumps(event)
)
self.producer.flush()# 2. Price Analysis Agent — A2A Task Delegation
# Note: the following is pseudocode expressing the A2A protocol flow.
# Actual implementation requires the google-a2a reference SDK or direct use of httpx.
import asyncio
import httpx
from confluent_kafka import Consumer
class PriceAnalysisAgent:
def __init__(self):
self.consumer = Consumer({
'bootstrap.servers': 'kafka:9092',
'group.id': 'price-analysis-group',
'auto.offset.reset': 'earliest'
})
self.consumer.subscribe(['ecommerce.price-changed.v1'])
# Also accesses internal inventory DB via MCP
self.mcp_client = MCPClient("inventory-db-server")
async def process(self):
loop = asyncio.get_event_loop()
while True:
# confluent_kafka poll() is synchronous,
# so wrap in executor to avoid blocking the asyncio event loop
msg = await loop.run_in_executor(None, self.consumer.poll, 1.0)
if msg is None or msg.error():
continue
event = json.loads(msg.value())
# Check internal inventory status via MCP
inventory = await self.mcp_client.call_tool(
"get_inventory", {"product_id": event["product_id"]}
)
should_adjust = self.analyze_price_gap(event, inventory)
if should_adjust:
# A2A: fetch Agent Card, then delegate to order update agent
async with httpx.AsyncClient() as client:
await client.post(
"http://order-update-agent/tasks/send",
json={
"message": {
"parts": [{
"text": f"Update price for {event['product_id']} to {event['new_price'] * 0.99}"
}]
},
"metadata": {"correlation_id": event["correlation_id"]}
}
)
# Commit offset after successful processing
self.consumer.commit()| Step | Agent | Role | Communication |
|---|---|---|---|
| 1 | PriceCollector | Collect competitor prices | MCP (scraping tool) → Kafka publish |
| 2 | PriceAnalysis | Check inventory + analyze price gap | MCP (DB query) + Kafka subscribe → A2A delegate |
| 3 | OrderUpdate | Update order price | A2A receive → Kafka publish |
| 4 | CustomerNotify | Send customer notification | Kafka subscribe → MCP (notification tool) |
Example 2: Manufacturing IoT Predictive Maintenance
This is the clearest example of why Flink belongs in this stack. Looking at a single sensor reading tells you nothing about whether something is wrong. You need to aggregate values across multiple sensors over a time window before a pattern becomes visible.
The code below uses the Flink Java API. Flink's Java/Scala APIs are more mature and widely used in production; if you prefer Python, the same logic can be expressed with PyFlink.
// Flink Java: detect anomaly patterns from IoT sensor stream
// Publishes an event when average vibration over a 10-minute window exceeds threshold
DataStream<SensorReading> sensorStream = env
.addSource(new FlinkKafkaConsumer<>(
"manufacturing.sensor-reading.v1",
new SensorReadingDeserializer(),
kafkaProps
));
DataStream<MaintenanceAlert> alerts = sensorStream
.keyBy(SensorReading::getMachineId)
// 10-minute sliding window, updated every 1 minute
.window(SlidingEventTimeWindows.of(
Time.minutes(10), Time.minutes(1)
))
.aggregate(new VibrationAnomalyDetector())
.filter(alert -> alert.getSeverity() > THRESHOLD);
// Publish anomaly detection events to Kafka topic
// Diagnostics agent subscribes and delegates to maintenance agent via A2A
alerts.addSink(new FlinkKafkaProducer<>(
"agent.task-requests.maintenance",
new AlertSerializer(),
kafkaProps
));The full flow looks like this:
IoT Sensors
→ Kafka(manufacturing.sensor-reading.v1)
→ Flink(10-minute window anomaly detection)
→ Kafka(agent.task-requests.maintenance)
→ Diagnostics Agent
→ A2A
→ Maintenance Scheduling AgentWithout Flink, the diagnostics agent would have to process the raw sensor stream directly. Feeding thousands of raw data points per second to an LLM-based agent is untenable in both token cost and latency. With Flink handling the filtering first, the agent only needs to process one refined event: "vibration anomaly detected within the last 10 minutes."
Kafka Topic Design — The "Contract" of the Agent Ecosystem
Topic structure design is the part practitioners think hardest about in practice. It's difficult to change later, so it pays to get it right from the start. On my first project, I kept topic names too simple and ran into trouble when event types multiplied.
# Recommended topic naming pattern: {domain}.{event-type}.{version}
topics:
# Raw events (published by agents)
- ecommerce.price-changed.v1
- ecommerce.order-updated.v1
- manufacturing.sensor-reading.v1
# Inter-agent task delegation
- agent.task-requests.pricing
- agent.task-results.pricing
- agent.task-requests.maintenance
- agent.dead-letter.v1 # isolates repeatedly failing messages
# Notifications and actions
- notification.customer-alert.v1
- action.price-adjustment.v1
# Register Avro schemas in Schema Registry
# Ensures backward compatibility when agent message formats change
schemas:
price-changed:
type: record
fields:
- name: product_id
type: string
- name: old_price
type: double
- name: new_price
type: double
- name: source_agent
type: string
- name: correlation_id # for distributed tracing — always include from the start
type: stringSchema Registry: A component that centrally manages versioning of the message formats (schemas) exchanged between agents. It prevents situations where Agent B suddenly breaks when Agent A changes its message structure. Confluent Schema Registry is the most widely used implementation.
Trade-off Analysis
The most common question I receive is "isn't the operational overhead too high?" — and honestly, that's not wrong. Here's a breakdown of when and how much of that overhead is worth accepting.
Advantages
| Item | Details |
|---|---|
| Async decoupling | Agents don't call each other directly, so a failure in one agent doesn't propagate to the whole system |
| Horizontal scaling | Agents, Kafka partitions, and Flink jobs can each be scaled out independently |
| Event reproducibility | All agent actions are recorded in the Kafka log, enabling auditing, debugging, and replay |
| Real-time responsiveness | Flink processes event streams at millisecond granularity, supporting immediate agent reactions |
| Vendor neutrality | A2A standard enables collaboration across heterogeneous frameworks — LangChain, CrewAI, SAP, Salesforce, and more |
| Long-running workflows | Flink stateful processing supports agent workflows spanning hours to days |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Operational complexity | Requires running a Kafka cluster, Schema Registry, and Flink cluster simultaneously | Offload cluster management with Confluent Cloud. Small-scale starts are fine with a single broker (replication.factor=1) |
| Protocol maturity | A2A launched in April 2025; native MCP/A2A support in Flink is still maturing | Adopt only on critical paths first; add one abstraction layer to buffer against Agent Card spec changes |
| Latency trade-off | The async nature makes it unsuitable for sub-millisecond requirements like HFT | Keep external API boundaries that strictly need synchronous responses on gRPC/HTTP direct calls |
| Debugging difficulty | Tracing root causes across multiple asynchronous agents is complex | Mandatory adoption of OpenTelemetry + correlation_id. Visualize topic flow with Kafka UI or Confluent Control Center |
| Over-engineering risk | Introducing Kafka into a small agent system only adds complexity | Evaluate only when you have 10+ agents or thousands of events per second. HTTP is sufficient below that threshold |
| Security governance | Requires trust verification between agents, message source authentication, and permission scope control | Use Kafka ACLs + mTLS to control publish/subscribe permissions. Validate A2A Agent Cards to ensure delegation trust |
mTLS (Mutual TLS): An approach where both client and server prove their identity with certificates. Used to guarantee "did this message actually come from Agent A?" when Agent A publishes to Kafka.
The Most Common Mistakes in Practice
-
Trying to replace all agent communication with Kafka: On my first project, I made API endpoints that needed to respond to user requests in real time async, and it broke the UX. Kafka is well-suited for internal inter-agent communication; the boundary with user interfaces is often more naturally synchronous.
-
Starting without a
correlation_id: Retrofitting distributed tracing later means modifying every existing message schema. Without it, tracing "which agent processed this message in what order" becomes nearly impossible — strongly recommended to include in every event from day one. -
Making topics too granular or too coarse: Putting all agent messages into a single
agent-communicationtopic forces every subscriber to receive irrelevant data. Conversely, creating a topic per message type becomes unmanageable. The balanced approach is to segment by domain and use Schema Registry to distinguish message types.
Closing Thoughts
Kafka is the infrastructure layer that lets AI agents collaborate without needing to know each other, and together with A2A and MCP, it completes a structure in which tens to hundreds of agents can operate at scale.
There are three steps you can take right now to get started.
-
Start by setting up a local environment. Spinning up Kafka + Schema Registry + Kafka UI with
docker-composeis the fastest entry point. Using Confluent's official docker-compose example as a base,docker-compose up -dgets you a working environment in under five minutes. -
Attach an MCP server to one of your existing agents. Using Anthropic's official MCP Python SDK (
pip install mcp), wrap an existing API call as an MCP tool and invoke it from a LangChain or CrewAI agent. That hands-on experience is the fastest way to make the concept concrete. -
Insert a Kafka topic between two agents. Change a direct A→B call to A→Kafka→B, then force-kill B and watch the message survive — picked up again when B restarts. The moment you see that in action, you feel in your bones why this architecture matters.
References
- Agentic AI with the Agent2Agent Protocol (A2A) and MCP using Apache Kafka as Event Broker | Kai Waehner
- Why Google's Agent2Agent Protocol Needs Apache Kafka | Confluent Blog
- A2A, MCP, Kafka and Flink: The New Stack for AI Agents | The New Stack
- How Apache Kafka and Flink Power Event-Driven Agentic AI in Real Time | Kai Waehner
- The Future of AI Agents Is Event-Driven | Confluent Blog
- How Kafka improves agentic AI | Red Hat Developer