Apache Kafka as an AI Agent Event Broker — Scaling MCP·A2A Multi-Agent Systems

Most people start with HTTP when first setting up a multi-agent system. Agent A calls B directly, B calls C, and so on. With the first two or three agents, things work pretty well. But once you go past ten, things start to feel off. Timeouts cascade, failures become impossible to trace, and every time you add a new agent you have to touch existing code. I went through that myself, and that's when it clicked — inter-agent communication carries exactly the same problems as microservices. If you're dealing with ten or more agents, or a scale that demands thousands of events per second, this post should be useful.

When Google open-sourced the A2A protocol in April 2025, this stack began to genuinely converge. Over 50 partners — Confluent, LangChain, CrewAI, SAP, Salesforce, and more — immediately announced support, and alongside Anthropic's MCP, Apache Kafka is establishing itself as the event broker for the agent ecosystem. This post explores why Apache Kafka is well-suited for that role and what architectural patterns it enables together with MCP and A2A, grounded in real-world examples.

Core Concepts

Separation of Concerns Across Four Layers

When you first encounter this stack, MCP and A2A can look similar enough to be confusing. I had the same reaction at first — "aren't they both just agent communication?" — but their roles are clearly distinct.

Layer	Protocol/Technology	Role	Defines
Tool Access	MCP (Anthropic)	Agent ↔ External World	How agents access APIs, DBs, and files
Agent Collaboration	A2A (Google)	Agent ↔ Agent	How agents discover each other and delegate tasks
Message Delivery	Apache Kafka	Async message flow	How messages are delivered reliably
Stream Processing	Apache Flink	Real-time data transformation	How event streams are aggregated and transformed into agent inputs

In one sentence: A2A defines the language agents speak, MCP standardizes how agents access tools, Kafka guarantees message flow, and Flink processes that flow.

MCP (Model Context Protocol): A protocol that standardizes how agents interact with external tools and data sources — think USB-C for agents. It solves the problem of frameworks like LangChain and CrewAI each connecting tools in their own incompatible ways.

A2A (Agent-to-Agent Protocol): An open protocol published by Google that enables agents built with different vendors or frameworks to collaborate. Agents advertise their capabilities via an Agent Card and delegate tasks through the Task API.

Apache Flink: An engine that processes, transforms, and aggregates Kafka streams in real time. LLM-based agents can't directly handle thousands of raw sensor or transaction events per second — the token cost and latency don't add up. Flink takes that burden so each agent only needs to receive a single, refined event.

Why HTTP Breaks Down at Agent Scale

The question I hear most often is: "Why Kafka at all? Isn't HTTP/REST enough?" Honestly, for small scale, HTTP is much simpler. But as the number of agents grows, HTTP-based architecture reproduces every spaghetti problem from microservices.

yaml

# HTTP-based (point-to-point)
AgentA → AgentB
AgentA → AgentC
AgentB → AgentC
AgentB → AgentD
...
# n agents = up to n*(n-1)/2 direct connections
# One agent failure = propagates to all connected agents
 
# Kafka-based (publish-subscribe)
AgentA → [Kafka topic: price-events] → AgentB, C, D (subscribers)
AgentB → [Kafka topic: analysis-results] → AgentE, F (subscribers)
# Agents only need to know the topic. They don't need to know each other.
# One agent failure = only that agent is affected

Publish-Subscribe (pub/sub): A messaging pattern where senders (publishers) and receivers (subscribers) don't need to know each other directly. Publishers drop messages onto a topic; subscribers pick up only the topics they care about.

Kafka changes three things. First, agents don't need to hold direct references to each other (loose coupling). Second, every message is durably recorded in the Kafka log, enabling auditing and replay. Third, agents and partitions can be scaled out independently and horizontally.

The Orchestrator-Worker Pattern on Kafka

The most representative pattern for composing agents on top of Kafka is orchestrator-worker.

css

┌──────────────────────────────────────────────────────┐
│                     Kafka Topics                      │
│                                                       │
│  [task-requests]  [task-results]  [dead-letter]       │
└──────────┬──────────────┬─────────────────────────────┘
           │              │
    ┌──────┴──────┐       │
    │ Orchestrator│◄──────┘
    │   Agent     │  Collects results and
    │ (A2A deleg.)│  decides next step
    └──────┬──────┘
           │ Publishes tasks
    ┌──────┴──────────────────────┐
    │                             │
┌───┴────┐  ┌────────┐  ┌────────┴┐
│Worker A│  │Worker B│  │Worker C │
│(analyze│  │(pricing│  │(notify) │
└────────┘  └────────┘  └─────────┘

The orchestrator publishes tasks to the task-requests topic, and workers subscribe to the task types they can handle. Task assignment happens naturally through Kafka consumer group partition assignment, so the orchestrator never needs to address individual workers by name. Results are published back to the task-results topic, and the orchestrator aggregates them to decide the next step.

The most common question I get in practice is "what happens if some workers fail?" — and Kafka's offset management solves this. If a worker fails to process a message, it doesn't commit the offset, so the message becomes eligible for reprocessing. Messages that fail repeatedly can be routed to the dead-letter topic to quarantine them without blocking the overall flow.

A2A Agent Card: How Agents Discover Each Other

One of the central concepts in A2A is the Agent Card — a JSON document in which an agent declares its capabilities, served at /.well-known/agent.json.

json

{
  "name": "PriceAnalysisAgent",
  "description": "Analyzes competitor pricing data and generates price adjustment recommendations",
  "url": "http://price-analysis-agent/",
  "capabilities": {
    "streaming": false,
    "pushNotifications": true
  },
  "skills": [
    {
      "id": "analyze-price-gap",
      "name": "Price Gap Analysis",
      "description": "Accepts a product ID and competitor price, returns adjustment recommendations",
      "inputModes": ["text"],
      "outputModes": ["text"]
    }
  ]
}

At runtime, the orchestrator fetches this Agent Card to learn "which agent can handle which tasks" and delegates via A2A Tasks. The key advantage is that agents can be added or replaced without any changes to orchestrator code.

Real-World Applications

Example 1: E-Commerce Real-Time Dynamic Pricing

This is a real pattern I found compelling. It's a system that automatically responds when a competitor's price changes — and it's a great demonstration of how naturally the Kafka + A2A combination fits together.

python

# 1. Price Collector Agent (Kafka Producer)
# Scrapes competitor website via MCP, then publishes event
from confluent_kafka import Producer
import json
import uuid
 
class PriceCollectorAgent:
    def __init__(self):
        self.producer = Producer({'bootstrap.servers': 'kafka:9092'})
        self.mcp_client = MCPClient("scraping-server")
 
    async def collect_and_publish(self, competitor_url: str):
        # MCP tool call: scrape competitor price
        price_data = await self.mcp_client.call_tool(
            "scrape_price",
            {"url": competitor_url}
        )
 
        event = {
            "event_type": "competitor_price_changed",
            "competitor": competitor_url,
            "new_price": price_data["price"],
            "product_id": price_data["product_id"],
            "timestamp": price_data["timestamp"],
            "correlation_id": str(uuid.uuid4())  # for distributed tracing — always include from the start
        }
 
        self.producer.produce(
            topic="ecommerce.price-changed.v1",
            key=price_data["product_id"],
            value=json.dumps(event)
        )
        self.producer.flush()

python

# 2. Price Analysis Agent — A2A Task Delegation
# Note: the following is pseudocode expressing the A2A protocol flow.
# Actual implementation requires the google-a2a reference SDK or direct use of httpx.
import asyncio
import httpx
from confluent_kafka import Consumer
 
class PriceAnalysisAgent:
    def __init__(self):
        self.consumer = Consumer({
            'bootstrap.servers': 'kafka:9092',
            'group.id': 'price-analysis-group',
            'auto.offset.reset': 'earliest'
        })
        self.consumer.subscribe(['ecommerce.price-changed.v1'])
        # Also accesses internal inventory DB via MCP
        self.mcp_client = MCPClient("inventory-db-server")
 
    async def process(self):
        loop = asyncio.get_event_loop()
        while True:
            # confluent_kafka poll() is synchronous,
            # so wrap in executor to avoid blocking the asyncio event loop
            msg = await loop.run_in_executor(None, self.consumer.poll, 1.0)
            if msg is None or msg.error():
                continue
 
            event = json.loads(msg.value())
 
            # Check internal inventory status via MCP
            inventory = await self.mcp_client.call_tool(
                "get_inventory", {"product_id": event["product_id"]}
            )
 
            should_adjust = self.analyze_price_gap(event, inventory)
 
            if should_adjust:
                # A2A: fetch Agent Card, then delegate to order update agent
                async with httpx.AsyncClient() as client:
                    await client.post(
                        "http://order-update-agent/tasks/send",
                        json={
                            "message": {
                                "parts": [{
                                    "text": f"Update price for {event['product_id']} to {event['new_price'] * 0.99}"
                                }]
                            },
                            "metadata": {"correlation_id": event["correlation_id"]}
                        }
                    )
 
            # Commit offset after successful processing
            self.consumer.commit()

Step	Agent	Role	Communication
1	PriceCollector	Collect competitor prices	MCP (scraping tool) → Kafka publish
2	PriceAnalysis	Check inventory + analyze price gap	MCP (DB query) + Kafka subscribe → A2A delegate
3	OrderUpdate	Update order price	A2A receive → Kafka publish
4	CustomerNotify	Send customer notification	Kafka subscribe → MCP (notification tool)

Example 2: Manufacturing IoT Predictive Maintenance

This is the clearest example of why Flink belongs in this stack. Looking at a single sensor reading tells you nothing about whether something is wrong. You need to aggregate values across multiple sensors over a time window before a pattern becomes visible.

The code below uses the Flink Java API. Flink's Java/Scala APIs are more mature and widely used in production; if you prefer Python, the same logic can be expressed with PyFlink.

java

// Flink Java: detect anomaly patterns from IoT sensor stream
// Publishes an event when average vibration over a 10-minute window exceeds threshold
DataStream<SensorReading> sensorStream = env
    .addSource(new FlinkKafkaConsumer<>(
        "manufacturing.sensor-reading.v1",
        new SensorReadingDeserializer(),
        kafkaProps
    ));
 
DataStream<MaintenanceAlert> alerts = sensorStream
    .keyBy(SensorReading::getMachineId)
    // 10-minute sliding window, updated every 1 minute
    .window(SlidingEventTimeWindows.of(
        Time.minutes(10), Time.minutes(1)
    ))
    .aggregate(new VibrationAnomalyDetector())
    .filter(alert -> alert.getSeverity() > THRESHOLD);
 
// Publish anomaly detection events to Kafka topic
// Diagnostics agent subscribes and delegates to maintenance agent via A2A
alerts.addSink(new FlinkKafkaProducer<>(
    "agent.task-requests.maintenance",
    new AlertSerializer(),
    kafkaProps
));

The full flow looks like this:

IoT Sensors
  → Kafka(manufacturing.sensor-reading.v1)
  → Flink(10-minute window anomaly detection)
  → Kafka(agent.task-requests.maintenance)
  → Diagnostics Agent
  → A2A
  → Maintenance Scheduling Agent

Without Flink, the diagnostics agent would have to process the raw sensor stream directly. Feeding thousands of raw data points per second to an LLM-based agent is untenable in both token cost and latency. With Flink handling the filtering first, the agent only needs to process one refined event: "vibration anomaly detected within the last 10 minutes."

Kafka Topic Design — The "Contract" of the Agent Ecosystem

Topic structure design is the part practitioners think hardest about in practice. It's difficult to change later, so it pays to get it right from the start. On my first project, I kept topic names too simple and ran into trouble when event types multiplied.

yaml

# Recommended topic naming pattern: {domain}.{event-type}.{version}
 
topics:
  # Raw events (published by agents)
  - ecommerce.price-changed.v1
  - ecommerce.order-updated.v1
  - manufacturing.sensor-reading.v1
 
  # Inter-agent task delegation
  - agent.task-requests.pricing
  - agent.task-results.pricing
  - agent.task-requests.maintenance
  - agent.dead-letter.v1          # isolates repeatedly failing messages
 
  # Notifications and actions
  - notification.customer-alert.v1
  - action.price-adjustment.v1
 
# Register Avro schemas in Schema Registry
# Ensures backward compatibility when agent message formats change
schemas:
  price-changed:
    type: record
    fields:
      - name: product_id
        type: string
      - name: old_price
        type: double
      - name: new_price
        type: double
      - name: source_agent
        type: string
      - name: correlation_id    # for distributed tracing — always include from the start
        type: string

Schema Registry: A component that centrally manages versioning of the message formats (schemas) exchanged between agents. It prevents situations where Agent B suddenly breaks when Agent A changes its message structure. Confluent Schema Registry is the most widely used implementation.

Trade-off Analysis

The most common question I receive is "isn't the operational overhead too high?" — and honestly, that's not wrong. Here's a breakdown of when and how much of that overhead is worth accepting.

Advantages

Item	Details
Async decoupling	Agents don't call each other directly, so a failure in one agent doesn't propagate to the whole system
Horizontal scaling	Agents, Kafka partitions, and Flink jobs can each be scaled out independently
Event reproducibility	All agent actions are recorded in the Kafka log, enabling auditing, debugging, and replay
Real-time responsiveness	Flink processes event streams at millisecond granularity, supporting immediate agent reactions
Vendor neutrality	A2A standard enables collaboration across heterogeneous frameworks — LangChain, CrewAI, SAP, Salesforce, and more
Long-running workflows	Flink stateful processing supports agent workflows spanning hours to days

Disadvantages and Caveats

Item	Details	Mitigation
Operational complexity	Requires running a Kafka cluster, Schema Registry, and Flink cluster simultaneously	Offload cluster management with Confluent Cloud. Small-scale starts are fine with a single broker (`replication.factor=1`)
Protocol maturity	A2A launched in April 2025; native MCP/A2A support in Flink is still maturing	Adopt only on critical paths first; add one abstraction layer to buffer against Agent Card spec changes
Latency trade-off	The async nature makes it unsuitable for sub-millisecond requirements like HFT	Keep external API boundaries that strictly need synchronous responses on gRPC/HTTP direct calls
Debugging difficulty	Tracing root causes across multiple asynchronous agents is complex	Mandatory adoption of OpenTelemetry + `correlation_id`. Visualize topic flow with Kafka UI or Confluent Control Center
Over-engineering risk	Introducing Kafka into a small agent system only adds complexity	Evaluate only when you have 10+ agents or thousands of events per second. HTTP is sufficient below that threshold
Security governance	Requires trust verification between agents, message source authentication, and permission scope control	Use Kafka ACLs + mTLS to control publish/subscribe permissions. Validate A2A Agent Cards to ensure delegation trust

mTLS (Mutual TLS): An approach where both client and server prove their identity with certificates. Used to guarantee "did this message actually come from Agent A?" when Agent A publishes to Kafka.

The Most Common Mistakes in Practice

Trying to replace all agent communication with Kafka: On my first project, I made API endpoints that needed to respond to user requests in real time async, and it broke the UX. Kafka is well-suited for internal inter-agent communication; the boundary with user interfaces is often more naturally synchronous.
Starting without a correlation_id: Retrofitting distributed tracing later means modifying every existing message schema. Without it, tracing "which agent processed this message in what order" becomes nearly impossible — strongly recommended to include in every event from day one.
Making topics too granular or too coarse: Putting all agent messages into a single agent-communication topic forces every subscriber to receive irrelevant data. Conversely, creating a topic per message type becomes unmanageable. The balanced approach is to segment by domain and use Schema Registry to distinguish message types.

Closing Thoughts

Kafka is the infrastructure layer that lets AI agents collaborate without needing to know each other, and together with A2A and MCP, it completes a structure in which tens to hundreds of agents can operate at scale.

There are three steps you can take right now to get started.

Start by setting up a local environment. Spinning up Kafka + Schema Registry + Kafka UI with docker-compose is the fastest entry point. Using Confluent's official docker-compose example as a base, docker-compose up -d gets you a working environment in under five minutes.
Attach an MCP server to one of your existing agents. Using Anthropic's official MCP Python SDK (pip install mcp), wrap an existing API call as an MCP tool and invoke it from a LangChain or CrewAI agent. That hands-on experience is the fastest way to make the concept concrete.
Insert a Kafka topic between two agents. Change a direct A→B call to A→Kafka→B, then force-kill B and watch the message survive — picked up again when B restarts. The moment you see that in action, you feel in your bones why this architecture matters.

References

Apache Kafka as an AI Agent Event Broker — Scaling MCP·A2A Multi-Agent Systems | DEV BAK - 기술블로그

DevOps

Apache Kafka as an AI Agent Event Broker — Scaling MCP·A2A Multi-Agent Systems

Core Concepts

Separation of Concerns Across Four Layers

Layer	Protocol/Technology	Role	Defines
Tool Access	MCP (Anthropic)	Agent ↔ External World	How agents access APIs, DBs, and files
Agent Collaboration	A2A (Google)	Agent ↔ Agent	How agents discover each other and delegate tasks
Message Delivery	Apache Kafka	Async message flow	How messages are delivered reliably
Stream Processing	Apache Flink	Real-time data transformation	How event streams are aggregated and transformed into agent inputs

In one sentence: A2A defines the language agents speak, MCP standardizes how agents access tools, Kafka guarantees message flow, and Flink processes that flow.

MCP (Model Context Protocol): A protocol that standardizes how agents interact with external tools and data sources — think USB-C for agents. It solves the problem of frameworks like LangChain and CrewAI each connecting tools in their own incompatible ways.

A2A (Agent-to-Agent Protocol): An open protocol published by Google that enables agents built with different vendors or frameworks to collaborate. Agents advertise their capabilities via an Agent Card and delegate tasks through the Task API.

Apache Flink: An engine that processes, transforms, and aggregates Kafka streams in real time. LLM-based agents can't directly handle thousands of raw sensor or transaction events per second — the token cost and latency don't add up. Flink takes that burden so each agent only needs to receive a single, refined event.

Why HTTP Breaks Down at Agent Scale

yaml

# HTTP-based (point-to-point)
AgentA → AgentB
AgentA → AgentC
AgentB → AgentC
AgentB → AgentD
...
# n agents = up to n*(n-1)/2 direct connections
# One agent failure = propagates to all connected agents
 
# Kafka-based (publish-subscribe)
AgentA → [Kafka topic: price-events] → AgentB, C, D (subscribers)
AgentB → [Kafka topic: analysis-results] → AgentE, F (subscribers)
# Agents only need to know the topic. They don't need to know each other.
# One agent failure = only that agent is affected

Publish-Subscribe (pub/sub): A messaging pattern where senders (publishers) and receivers (subscribers) don't need to know each other directly. Publishers drop messages onto a topic; subscribers pick up only the topics they care about.

The Orchestrator-Worker Pattern on Kafka

The most representative pattern for composing agents on top of Kafka is orchestrator-worker.

css

┌──────────────────────────────────────────────────────┐
│                     Kafka Topics                      │
│                                                       │
│  [task-requests]  [task-results]  [dead-letter]       │
└──────────┬──────────────┬─────────────────────────────┘
           │              │
    ┌──────┴──────┐       │
    │ Orchestrator│◄──────┘
    │   Agent     │  Collects results and
    │ (A2A deleg.)│  decides next step
    └──────┬──────┘
           │ Publishes tasks
    ┌──────┴──────────────────────┐
    │                             │
┌───┴────┐  ┌────────┐  ┌────────┴┐
│Worker A│  │Worker B│  │Worker C │
│(analyze│  │(pricing│  │(notify) │
└────────┘  └────────┘  └─────────┘

A2A Agent Card: How Agents Discover Each Other

One of the central concepts in A2A is the Agent Card — a JSON document in which an agent declares its capabilities, served at /.well-known/agent.json.

json

{
  "name": "PriceAnalysisAgent",
  "description": "Analyzes competitor pricing data and generates price adjustment recommendations",
  "url": "http://price-analysis-agent/",
  "capabilities": {
    "streaming": false,
    "pushNotifications": true
  },
  "skills": [
    {
      "id": "analyze-price-gap",
      "name": "Price Gap Analysis",
      "description": "Accepts a product ID and competitor price, returns adjustment recommendations",
      "inputModes": ["text"],
      "outputModes": ["text"]
    }
  ]
}

Real-World Applications

Example 1: E-Commerce Real-Time Dynamic Pricing

python

# 1. Price Collector Agent (Kafka Producer)
# Scrapes competitor website via MCP, then publishes event
from confluent_kafka import Producer
import json
import uuid
 
class PriceCollectorAgent:
    def __init__(self):
        self.producer = Producer({'bootstrap.servers': 'kafka:9092'})
        self.mcp_client = MCPClient("scraping-server")
 
    async def collect_and_publish(self, competitor_url: str):
        # MCP tool call: scrape competitor price
        price_data = await self.mcp_client.call_tool(
            "scrape_price",
            {"url": competitor_url}
        )
 
        event = {
            "event_type": "competitor_price_changed",
            "competitor": competitor_url,
            "new_price": price_data["price"],
            "product_id": price_data["product_id"],
            "timestamp": price_data["timestamp"],
            "correlation_id": str(uuid.uuid4())  # for distributed tracing — always include from the start
        }
 
        self.producer.produce(
            topic="ecommerce.price-changed.v1",
            key=price_data["product_id"],
            value=json.dumps(event)
        )
        self.producer.flush()

python

# 2. Price Analysis Agent — A2A Task Delegation
# Note: the following is pseudocode expressing the A2A protocol flow.
# Actual implementation requires the google-a2a reference SDK or direct use of httpx.
import asyncio
import httpx
from confluent_kafka import Consumer
 
class PriceAnalysisAgent:
    def __init__(self):
        self.consumer = Consumer({
            'bootstrap.servers': 'kafka:9092',
            'group.id': 'price-analysis-group',
            'auto.offset.reset': 'earliest'
        })
        self.consumer.subscribe(['ecommerce.price-changed.v1'])
        # Also accesses internal inventory DB via MCP
        self.mcp_client = MCPClient("inventory-db-server")
 
    async def process(self):
        loop = asyncio.get_event_loop()
        while True:
            # confluent_kafka poll() is synchronous,
            # so wrap in executor to avoid blocking the asyncio event loop
            msg = await loop.run_in_executor(None, self.consumer.poll, 1.0)
            if msg is None or msg.error():
                continue
 
            event = json.loads(msg.value())
 
            # Check internal inventory status via MCP
            inventory = await self.mcp_client.call_tool(
                "get_inventory", {"product_id": event["product_id"]}
            )
 
            should_adjust = self.analyze_price_gap(event, inventory)
 
            if should_adjust:
                # A2A: fetch Agent Card, then delegate to order update agent
                async with httpx.AsyncClient() as client:
                    await client.post(
                        "http://order-update-agent/tasks/send",
                        json={
                            "message": {
                                "parts": [{
                                    "text": f"Update price for {event['product_id']} to {event['new_price'] * 0.99}"
                                }]
                            },
                            "metadata": {"correlation_id": event["correlation_id"]}
                        }
                    )
 
            # Commit offset after successful processing
            self.consumer.commit()

Step	Agent	Role	Communication
1	PriceCollector	Collect competitor prices	MCP (scraping tool) → Kafka publish
2	PriceAnalysis	Check inventory + analyze price gap	MCP (DB query) + Kafka subscribe → A2A delegate
3	OrderUpdate	Update order price	A2A receive → Kafka publish
4	CustomerNotify	Send customer notification	Kafka subscribe → MCP (notification tool)

Example 2: Manufacturing IoT Predictive Maintenance

The code below uses the Flink Java API. Flink's Java/Scala APIs are more mature and widely used in production; if you prefer Python, the same logic can be expressed with PyFlink.

java

// Flink Java: detect anomaly patterns from IoT sensor stream
// Publishes an event when average vibration over a 10-minute window exceeds threshold
DataStream<SensorReading> sensorStream = env
    .addSource(new FlinkKafkaConsumer<>(
        "manufacturing.sensor-reading.v1",
        new SensorReadingDeserializer(),
        kafkaProps
    ));
 
DataStream<MaintenanceAlert> alerts = sensorStream
    .keyBy(SensorReading::getMachineId)
    // 10-minute sliding window, updated every 1 minute
    .window(SlidingEventTimeWindows.of(
        Time.minutes(10), Time.minutes(1)
    ))
    .aggregate(new VibrationAnomalyDetector())
    .filter(alert -> alert.getSeverity() > THRESHOLD);
 
// Publish anomaly detection events to Kafka topic
// Diagnostics agent subscribes and delegates to maintenance agent via A2A
alerts.addSink(new FlinkKafkaProducer<>(
    "agent.task-requests.maintenance",
    new AlertSerializer(),
    kafkaProps
));

The full flow looks like this:

IoT Sensors
  → Kafka(manufacturing.sensor-reading.v1)
  → Flink(10-minute window anomaly detection)
  → Kafka(agent.task-requests.maintenance)
  → Diagnostics Agent
  → A2A
  → Maintenance Scheduling Agent

Kafka Topic Design — The "Contract" of the Agent Ecosystem

yaml

# Recommended topic naming pattern: {domain}.{event-type}.{version}
 
topics:
  # Raw events (published by agents)
  - ecommerce.price-changed.v1
  - ecommerce.order-updated.v1
  - manufacturing.sensor-reading.v1
 
  # Inter-agent task delegation
  - agent.task-requests.pricing
  - agent.task-results.pricing
  - agent.task-requests.maintenance
  - agent.dead-letter.v1          # isolates repeatedly failing messages
 
  # Notifications and actions
  - notification.customer-alert.v1
  - action.price-adjustment.v1
 
# Register Avro schemas in Schema Registry
# Ensures backward compatibility when agent message formats change
schemas:
  price-changed:
    type: record
    fields:
      - name: product_id
        type: string
      - name: old_price
        type: double
      - name: new_price
        type: double
      - name: source_agent
        type: string
      - name: correlation_id    # for distributed tracing — always include from the start
        type: string

Schema Registry: A component that centrally manages versioning of the message formats (schemas) exchanged between agents. It prevents situations where Agent B suddenly breaks when Agent A changes its message structure. Confluent Schema Registry is the most widely used implementation.

Trade-off Analysis

The most common question I receive is "isn't the operational overhead too high?" — and honestly, that's not wrong. Here's a breakdown of when and how much of that overhead is worth accepting.

Advantages

Item	Details
Async decoupling	Agents don't call each other directly, so a failure in one agent doesn't propagate to the whole system
Horizontal scaling	Agents, Kafka partitions, and Flink jobs can each be scaled out independently
Event reproducibility	All agent actions are recorded in the Kafka log, enabling auditing, debugging, and replay
Real-time responsiveness	Flink processes event streams at millisecond granularity, supporting immediate agent reactions
Vendor neutrality	A2A standard enables collaboration across heterogeneous frameworks — LangChain, CrewAI, SAP, Salesforce, and more
Long-running workflows	Flink stateful processing supports agent workflows spanning hours to days

Disadvantages and Caveats

Item	Details	Mitigation
Operational complexity	Requires running a Kafka cluster, Schema Registry, and Flink cluster simultaneously	Offload cluster management with Confluent Cloud. Small-scale starts are fine with a single broker (`replication.factor=1`)
Protocol maturity	A2A launched in April 2025; native MCP/A2A support in Flink is still maturing	Adopt only on critical paths first; add one abstraction layer to buffer against Agent Card spec changes
Latency trade-off	The async nature makes it unsuitable for sub-millisecond requirements like HFT	Keep external API boundaries that strictly need synchronous responses on gRPC/HTTP direct calls
Debugging difficulty	Tracing root causes across multiple asynchronous agents is complex	Mandatory adoption of OpenTelemetry + `correlation_id`. Visualize topic flow with Kafka UI or Confluent Control Center
Over-engineering risk	Introducing Kafka into a small agent system only adds complexity	Evaluate only when you have 10+ agents or thousands of events per second. HTTP is sufficient below that threshold
Security governance	Requires trust verification between agents, message source authentication, and permission scope control	Use Kafka ACLs + mTLS to control publish/subscribe permissions. Validate A2A Agent Cards to ensure delegation trust

mTLS (Mutual TLS): An approach where both client and server prove their identity with certificates. Used to guarantee "did this message actually come from Agent A?" when Agent A publishes to Kafka.

The Most Common Mistakes in Practice

Trying to replace all agent communication with Kafka: On my first project, I made API endpoints that needed to respond to user requests in real time async, and it broke the UX. Kafka is well-suited for internal inter-agent communication; the boundary with user interfaces is often more naturally synchronous.
Starting without a correlation_id: Retrofitting distributed tracing later means modifying every existing message schema. Without it, tracing "which agent processed this message in what order" becomes nearly impossible — strongly recommended to include in every event from day one.
Making topics too granular or too coarse: Putting all agent messages into a single agent-communication topic forces every subscriber to receive irrelevant data. Conversely, creating a topic per message type becomes unmanageable. The balanced approach is to segment by domain and use Schema Registry to distinguish message types.

Closing Thoughts

There are three steps you can take right now to get started.

Start by setting up a local environment. Spinning up Kafka + Schema Registry + Kafka UI with docker-compose is the fastest entry point. Using Confluent's official docker-compose example as a base, docker-compose up -d gets you a working environment in under five minutes.
Attach an MCP server to one of your existing agents. Using Anthropic's official MCP Python SDK (pip install mcp), wrap an existing API call as an MCP tool and invoke it from a LangChain or CrewAI agent. That hands-on experience is the fastest way to make the concept concrete.
Insert a Kafka topic between two agents. Change a direct A→B call to A→Kafka→B, then force-kill B and watch the message survive — picked up again when B restarts. The moment you see that in action, you feel in your bones why this architecture matters.

Core Concepts

Separation of Concerns Across Four Layers

Why HTTP Breaks Down at Agent Scale

The Orchestrator-Worker Pattern on Kafka

A2A Agent Card: How Agents Discover Each Other

Real-World Applications

Example 1: E-Commerce Real-Time Dynamic Pricing

Example 2: Manufacturing IoT Predictive Maintenance

Kafka Topic Design — The "Contract" of the Agent Ecosystem

Trade-off Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

Separation of Concerns Across Four Layers

Why HTTP Breaks Down at Agent Scale

The Orchestrator-Worker Pattern on Kafka

A2A Agent Card: How Agents Discover Each Other

Real-World Applications

Example 1: E-Commerce Real-Time Dynamic Pricing

Example 2: Manufacturing IoT Predictive Maintenance

Kafka Topic Design — The "Contract" of the Agent Ecosystem

Trade-off Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Solving Monorepo CI Speed and Deployment Automation Simultaneously with Turborepo + Changesets

Turborepo CI Cache Hit Rate Up to 90%: Essential `dependsOn`, `outputs`, and `env` Configuration

How to Self-Host Turborepo Remote Cache Without Vercel — Cut CI Build Time by 40–80% with Docker Compose + MinIO + GitHub Actions

Automating Canary Rollbacks with Kargo + Argo Rollouts: AnalysisTemplate and Freight Propagation Blocking in Practice

GitOps Multi-Stage Promotion with Kargo — Automating dev to prod with Argo CD Integration

Saga Pattern in Practice — Designing Compensating Transactions for Microservice Distributed Systems (Choreography vs Orchestration)