To Keep EDA from Becoming a Distributed Monolith — Outbox, Saga, CQRS Core Patterns and 3 Production Pitfalls
When I first introduced Kafka (a high-throughput distributed message streaming platform), I thought "just send the event once and you're done." Then I caused an incident where inventory was deducted twice due to duplicate events. I still vividly remember the embarrassment of discovering two items being deducted after an order was completed. EDA (Event-Driven Architecture) is simple in concept, but there are patterns you absolutely must know to use it safely in production.
When splitting a monolithic service into microservices, at some point you hit this dilemma: "It's convenient for the order service to call the payment service directly, but won't the order service go down if the payment service slows down?" Coupling services together with synchronous HTTP calls ultimately creates a distributed monolith — a structure that collapses like dominoes when traffic spikes or one service fails. EDA solves this problem with the idea of "making services unaware of each other." The order service simply throws the fact "an order was created" at the broker and that's it.
Applying the four patterns covered in this article (Outbox, Saga, CQRS + Event Sourcing, Event-Carried State Transfer) can prevent incidents like double inventory deduction before they happen. I've compiled what I learned through direct experience, along with three pitfalls commonly encountered in production.
Core Concepts
Events Are 'Facts from the Past,' Not Commands
The most confusing part when first learning EDA is event naming. There's a reason to use OrderCreated instead of CreateOrder.
An event is an immutable fact that has already occurred. A Command may not execute, but an event has already happened. Consumers cannot reject it — they can only react.
EDA consists of three components.
| Component | Role | Example |
|---|---|---|
| Event Producer | Publishes events to the broker when a state change occurs | Order service, payment service |
| Event Broker | Central hub that receives, stores, and delivers events | Kafka, RabbitMQ, EventBridge |
| Event Consumer | Subscribes to events of interest and executes business logic | Notification service, settlement service |
Pub/Sub vs. Event Streaming — What's the Difference?
These two models are often confused, but the core difference is "when you read the event."
| Model | How It Works | Best Suited For |
|---|---|---|
| Pub/Sub | Broker delivers immediately to subscribers. Disappears after being read | Real-time notifications, one-way fan-out |
| Event Streaming | Consumers read directly from any desired offset in the log | Reprocessing, auditing, Event Sourcing |
This is why Kafka is so powerful. Because events are permanently stored on disk, new services can be attached and past events can be reprocessed from the beginning. This is impossible with the Pub/Sub model.
Practical Application
Now that we've looked at the concepts, let's see how they're actually implemented in code across four patterns.
Example 1: Transactional Outbox — Ensuring Atomicity Between DB Storage and Event Publishing
The first hard problem you encounter in practice is: "I need to save data to the DB and publish an event to Kafka — how do I guarantee both succeed?" What if the DB save succeeds but Kafka publishing fails? Or Kafka succeeds but the DB rolls back?
The Transactional Outbox pattern solves this dual write problem.
[Service]
│
└──── Single DB Transaction ────┬──► [orders table] (domain data)
└──► [outbox table] (events to publish)
│
[Debezium CDC] ← Detects DB changes in real-time and forwards to Kafka
│
[Kafka Broker]
│
[Notification/Settlement/Inventory Services]The key is binding "business logic execution" and "event recording" within the same DB transaction. Debezium (a CDC tool that captures changes from PostgreSQL, MySQL, etc. in real-time) detects outbox table changes and forwards them to Kafka.
-- outbox table schema
CREATE TABLE outbox (
id UUID PRIMARY KEY,
event_type VARCHAR(255) NOT NULL,
aggregate_id VARCHAR(255) NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
published BOOLEAN DEFAULT FALSE -- tracks publish completion in polling mode
);// Order creation service — handled in a single transaction (Knex)
async function createOrder(orderData: CreateOrderDto): Promise<void> {
await db.transaction(async (trx) => {
const [order] = await trx('orders').insert(orderData).returning('*');
await trx('outbox').insert({
id: uuid(),
event_type: 'OrderCreated',
aggregate_id: order.id,
payload: JSON.stringify({
orderId: order.id,
userId: orderData.userId,
items: orderData.items,
totalAmount: orderData.totalAmount,
createdAt: new Date().toISOString(),
}),
});
});
// After transaction commit, Debezium detects outbox changes and publishes to Kafka
}If you're starting without Debezium, the polling approach below is sufficient to experience the pattern. The published column is meaningful in this approach.
// Polling-based publisher (Knex) — when starting without Debezium
async function pollAndPublish(): Promise<void> {
const pending = await db('outbox')
.where({ published: false })
.orderBy('created_at', 'asc')
.limit(100);
for (const event of pending) {
await kafkaProducer.send({
topic: event.event_type,
messages: [{ key: event.aggregate_id, value: event.payload }],
});
// Mark as complete after successful publish
await db('outbox').where({ id: event.id }).update({ published: true });
}
}
// Check for unpublished events every second
setInterval(pollAndPublish, 1_000);This is exactly why fintech companies like Stripe and PayPal use this pattern for payment transactions. Inconsistency between payment data and events directly translates to financial loss.
When to use: Any situation where atomicity between the DB and event broker is absolutely required. When to avoid: If event ordering or atomicity isn't critical for simple notification events, direct publishing without Outbox is simpler.
Example 2: Saga Pattern — Chaining Distributed Transactions with Events
In a flow like "order → payment → inventory deduction → shipment preparation," what happens if payment fails in the middle? In a single DB, one rollback handles it, but when each step lives in a different service, it's a different story. The Saga pattern solves this problem with Compensating Transactions.
There are two ways to implement a Saga.
| Choreography | Orchestration | |
|---|---|---|
| Control | Each service decides for itself based on events | Central Orchestrator dictates the order |
| Coupling | Low — only shares event contracts | Medium — depends on the Orchestrator |
| Debugging | Difficult to trace flow | Easier — single entry point |
| Compensation Logic | Each service implements individually | Orchestrator manages centrally |
| Recommended For | Simple linear flows | Complex conditional branches and timeouts |
Honestly speaking, Choreography is clean with 3–4 services, but as the number grows beyond that, understanding "which events are published where and who subscribes" becomes increasingly difficult. Our team also started with Choreography, but switched to Orchestration when services exceeded 6 and flow tracing became too hard. That's why most production systems opt for Orchestration engines like Temporal or AWS Step Functions.
Temporal is an orchestration engine that lets you express Saga workflows as code. With the TypeScript SDK, it looks like this:
// Temporal TypeScript SDK — Order Saga workflow
import { proxyActivities } from '@temporalio/workflow';
import type * as activities from './activities';
const {
processPayment, deductInventory, prepareShipment,
cancelShipment, restoreInventory, refundPayment,
} = proxyActivities<typeof activities>({
startToCloseTimeout: '10 minutes',
});
export async function orderSagaWorkflow(orderId: string): Promise<void> {
try {
await processPayment({ orderId });
await deductInventory({ orderId });
await prepareShipment({ orderId });
} catch (error) {
// On failure, execute compensating transactions in reverse order
await cancelShipment({ orderId });
await restoreInventory({ orderId });
await refundPayment({ orderId });
throw error;
}
}Temporal's appeal is that it automatically saves workflow execution history. Even if the server restarts in the middle, it can resume from where it left off. Think of it as "AWS Step Functions, but in code."
When to use: When a business transaction spans multiple services. When to avoid: When there are 2 or fewer steps and it's simple, or when strong consistency is required — compensating transactions work by "undoing what already happened" and don't guarantee strict atomicity.
Example 3: CQRS + Event Sourcing — Separating Reads and Writes
When order list queries happen thousands of times per second but order creation happens far less often, using the same DB model is inefficient. CQRS (Command Query Responsibility Segregation) solves this by saying "design the write model and read model completely differently."
Command Side (Write) Query Side (Read)
──────────────────── ────────────────────
OrderService Projection Handler
│ │
▼ ▼
Event Store ──── Publish Events ─────► Read Model DB
(full event history) (search-optimized view)
e.g., Elasticsearch, RedisWhen Event Sourcing is applied together, the full event history is used as the store instead of the current state. To answer "what is the current order status?" you replay stored events in order to derive it. In the code, the 'order' in eventStore.getEvents('order', orderId) is a DDD aggregate type — a concept that groups related events as a single logical unit, and here it retrieves all events belonging to the 'order' aggregate.
// Order interface definition
interface Order {
status: string;
trackingNumber?: string;
cancelReason?: string;
[key: string]: unknown;
}
const initialOrder: Order = { status: 'UNKNOWN' };
// Rebuild order state from Event Store (Knex)
async function rebuildOrderState(orderId: string): Promise<Order> {
const events = await eventStore.getEvents('order', orderId);
return events.reduce<Order>((order, event) => {
switch (event.type) {
case 'OrderCreated':
return { ...order, ...event.payload, status: 'CREATED' };
case 'PaymentConfirmed':
return { ...order, status: 'PAID' };
case 'OrderShipped':
return { ...order, status: 'SHIPPED', trackingNumber: event.payload.trackingNumber };
case 'OrderCancelled':
return { ...order, status: 'CANCELLED', cancelReason: event.payload.reason };
default:
return order;
}
}, initialOrder);
}One important point to note: when tens of thousands of events accumulate, the cost of replaying from the beginning each time grows dramatically. The Snapshot pattern addresses this. Every certain number of events (e.g., every 100), the current state is saved as a snapshot, and subsequent replays start from the most recent snapshot. If you're planning to use Event Sourcing in production, it's recommended to design this pattern in from the start.
Events #1 ~ #100 → [Snapshot: save state at event 100]
Events #101 ~ #200 → [Snapshot: save state at event 200]
Query current state = Latest Snapshot + replay only events #201 onwardsWhen Event Sourcing shines: Financial and medical systems where legal audit logs are mandatory, complex domains requiring time-travel debugging like "what was the order status 3 hours ago?"
When to use: When the read/write traffic ratio is extremely different, or when preserving the full change history is a business requirement. When to avoid: Simple CRUD domains, small teams, or when a fast MVP is needed. If the answer to "do I need audit logs or time-travel debugging right now?" is "no," a conventional DB design is the better choice.
Example 4: Event-Carried State Transfer — Eliminating Additional API Calls
When a consumer receiving an event turns around and queries the order service again asking "what's the shipping address for this order?" — that's giving back EDA's advantages. The Event-Carried State Transfer pattern means embedding enough data in the event payload for consumers to complete their processing independently.
// Too thin an event — consumer needs additional queries
const thinEvent = {
type: 'OrderCreated',
orderId: '12345',
};
// An event consumers can be self-sufficient with
const richEvent = {
type: 'OrderCreated',
id: uuid(),
source: 'order-service',
time: new Date().toISOString(),
data: {
orderId: '12345',
userId: 'user-789',
userEmail: 'customer@example.com', // data the notification service needs
items: [{ productId: 'prod-1', quantity: 2, price: 15000 }],
totalAmount: 30000,
shippingAddress: { // data the shipping service needs
zipCode: '06234',
address: '123 Main St, Seoul...',
},
},
};There's a trade-off of larger payloads, but the effect of severing runtime dependencies between services is often greater. However, there are clear situations where this pattern should not be used. When payloads reach several MB, or when PII (personally identifiable information) needs to be replicated across multiple services are prime examples. The former puts heavy load on the event broker, and the latter becomes a nightmare when GDPR "right to be forgotten" requests come in and you have to find and delete data from every service it was replicated to.
When to use: When consumers can complete processing with just the event. When to avoid: PII-containing data, payloads of several MB in size, or when consumers always need to see the latest state.
Pros and Cons Analysis
Advantages
| Item | Description |
|---|---|
| Loose Coupling | Producers and consumers are unaware of each other's existence. Independent deployment and scaling possible |
| Elastic Scaling | Consumers can be horizontally scaled to linearly increase throughput |
| Fault Isolation | When a consumer fails, events are preserved in the broker for automatic reprocessing |
| Audit Log | When combined with Event Sourcing, full history is automatically captured |
| Flexible Integration | Adding new consumers requires no modification to existing producers |
Disadvantages and Caveats
| Item | Description | Mitigation |
|---|---|---|
| Duplicate Events | Most brokers guarantee at-least-once delivery | Idempotency implementation on the consumer side is mandatory |
| Ordering Difficult to Guarantee | Kafka only guarantees order within a partition | Design partition keys carefully (e.g., orderId) |
| Debugging Complexity | Unlike synchronous calls, there is no call stack | Correlation ID + OpenTelemetry distributed tracing is essential |
| Eventual Consistency | Immediate consistency between services cannot be guaranteed | Explicitly expose "processing" state in the UX |
| Schema Evolution | Consumers must handle older-versioned events | Enforce compatibility policies with Schema Registry |
| Compensating Transaction Cost | No automatic rollback on Saga failure | Compensation logic must be manually implemented at each step |
Eventual Consistency: A consistency model in distributed systems where all nodes converge to the same state "eventually," but may see temporarily different values immediately after a change. Choosing EDA means accepting this trade-off.
Idempotency: The property where executing the same operation multiple times produces the same result. An attribute consumers in EDA must have.
The Most Common Mistakes in Practice
-
Deferring idempotency implementation — Thinking "it usually comes just once, so it'll be fine" leads to major incidents when duplicate events flood in during a failure scenario. It's recommended to implement event ID-based duplicate checking from the very beginning.
-
Starting without Correlation IDs — Once you have more than 3 services, tracing "where did this event originate?" becomes impossible. Including
correlationIdandcausationIdin the event envelope from the start, and connecting the chain with OpenTelemetry (an open-source standard library for distributed tracing), makes life much easier later. -
Agreeing on schemas verbally only — Sharing "I'm adding a userId field to the OrderCreated event" only on Slack can cause an outage until the consumer team deploys. Using Confluent Schema Registry or Apicurio Registry to enforce schema changes at the code level is the safe approach.
Closing
To use EDA safely, you must ensure atomicity with Outbox, handle distributed transactions with Saga, and prevent duplicates with idempotency. Proficiency comes from implementing these three pillars rather than just understanding the concepts.
Three steps you can start right now:
-
Add an Outbox table — When developing a new feature, switch from publishing events directly to Kafka to recording them in the
outboxtable first. By replacing thekafkaProducer.sendportion in the polling code above with log output, you can immediately experience the pattern itself without Debezium. -
Attach Correlation IDs to events — Add
id,correlationId,source, andtimefields to your currently published event payloads, and connect consumers to passcorrelationIdthrough unchanged. Referencing the CloudEvents spec lets you immediately adopt a standard field structure. -
Refactor one consumer to be idempotent — Pick your most important consumer, record processed event IDs in Redis or a DB, and add logic to skip duplicate events. Our team applied this to the first consumer and it naturally spread to the rest.
References
Introductions & Overviews
- Event-Driven Architecture: A Complete Guide (2026) | RisingWave
- Event-Driven Architecture (EDA): A Complete Introduction | Confluent
- Event-driven architecture style | Azure Architecture Center
- Event-Driven Architecture | AWS
Deep Dives on Patterns
- Best Architectural Patterns for Event-Driven Systems | Gravitee
- The Ultimate Guide to Event-Driven Architecture Patterns | Solace
- Saga Design Pattern | Azure Architecture Center
- Saga choreography pattern | AWS Prescriptive Guidance
- Saga Pattern Demystified: Orchestration vs Choreography | ByteByteGo
- Event-Driven Architecture + CQRS + Saga Pattern: MSA Practical Design | youngju.dev
Production Operations
- Event-Driven Architecture: Production Pitfalls & Fixes | Medium
- Observability in Event-Driven Architectures | Datadog
- Designing Resilient Event-Driven Systems at Scale | InfoQ
Standards & Trends