Saga Pattern in Practice — Designing Compensating Transactions for Microservice Distributed Systems (Choreography vs Orchestration)
Do you know what's the first wall you hit after migrating to microservices? For me, it was the question: "The order was created, but the payment failed — how do I roll back the inventory?" What used to be solved with a single @Transactional in the monolithic days becomes a completely different world the moment you split into three services. The DBs are separate, the network can go down, and the compensation logic starts to tangle.
The Saga pattern is one of the most widely adopted approaches in production to address this problem. Instead of locking everything like 2PC (Two-Phase Commit), each service commits only to its own DB, and on failure, a "compensating transaction" logically reverses the change. In this post, we directly compare the two implementation approaches — Choreography and Orchestration — with TypeScript code, and examine how to design compensating events in real-world scenarios.
After reading this, you should have a solid sense of "which approach to use in which situation" and "why the Outbox pattern is practically mandatory." The code in this post is TypeScript-based, and backend developers who have worked with message brokers like Kafka will find it easier to follow along.
Core Concepts
The Problem Saga Solves
When handling transactions spanning multiple services in a distributed system, there are broadly two options: enforce atomicity with distributed locks like 2PC, or achieve consistency "eventually" with compensating events like Saga.
The reasons why 2PC is hard to choose in practice are clear. If the coordinator dies, the entire system halts, and throughput drops sharply while locks are held. Especially when systems that can't participate in transactions — like Kafka or external payment APIs — are involved, 2PC itself may become impossible.
Saga approaches this problem from a different angle.
[Traditional Distributed Transaction — 2PC]
BEGIN DISTRIBUTED TX
Order Service.insert() ← lock
Inventory Service.update() ← lock
Payment Service.charge() ← lock
COMMIT (or ROLLBACK) ← all at once
[Saga Pattern]
Order Service.insert() → commit → publish event
Inventory Service.update() → commit → publish event
Payment Service.charge() → commit (on failure: run compensating events in reverse)Eventual Consistency: Instead of immediate consistency, the property where all services' states eventually align over time. Saga intentionally accepts this trade-off.
To be honest, Saga gives up isolation. Between the time an order is created and the payment is processed, another Saga may read the intermediate state of an order that hasn't yet completed. In Saga literature, this is called the Lost Update or Intermediate State Visibility problem. This is different from a DB "dirty read" — the data is already committed within each service, but the overall Saga is not yet complete. It's worth confirming upfront whether exposing such intermediate states is acceptable for your business requirements.
What Is a Compensating Transaction?
A compensating transaction is not a DB rollback. It is a new, reverse operation that undoes an already-committed state.
| Original Action | Compensating Transaction |
|---|---|
| Create order (status: PENDING) | Cancel order (status: CANCELLED) |
| Reserve inventory (reserved: 10) | Release inventory (reserved: -10) |
| Process payment (charged: $50) | Refund payment (refunded: $50) |
The important point is that compensating transactions can also fail. That's why all compensating actions should be designed to be idempotent. If the same refund request comes in twice, it should only be processed once.
Practical Application
Example 1: Choreography — Self-Coordination via Events
Choreography is an approach where each service publishes and subscribes to events and moves autonomously, without a central coordinator. I also found this approach really elegant at first. Each service is independent, and all you need is a message broker. The ability to add a new service without touching existing ones was especially appealing.
// @EventHandler is a custom decorator representing message broker subscriptions.
// It can be implemented with NestJS's @nestjs/event-emitter or a Kafka consumer wrapper.
// 1. Order Service — creates order and publishes event via Outbox pattern
class OrderService {
constructor(
private readonly orderRepo: OrderRepository,
private readonly db: DataSource,
) {}
async createOrder(dto: CreateOrderDto): Promise<void> {
// Handle DB commit and event insertion in the same transaction (Outbox pattern)
await this.db.transaction(async (em) => {
const order = await em.save(Order, {
...dto,
status: OrderStatus.PENDING,
});
await em.save(OutboxEvent, {
aggregateId: order.id,
type: 'ORDER_CREATED',
payload: JSON.stringify({ orderId: order.id, items: dto.items }),
processedAt: null,
});
});
}
// Compensation on receiving failure event: cancel order
@EventHandler('INVENTORY_RESERVATION_FAILED')
async handleReservationFailed(event: ReservationFailedEvent): Promise<void> {
await this.db.transaction(async (em) => {
await em.update(Order, event.orderId, { status: OrderStatus.CANCELLED });
await em.save(OutboxEvent, {
aggregateId: event.orderId,
type: 'ORDER_CANCELLED',
payload: JSON.stringify({ orderId: event.orderId, reason: event.reason }),
processedAt: null,
});
});
}
}
// 2. Inventory Service — subscribes to OrderCreated and attempts stock reservation
class InventoryService {
constructor(private readonly db: DataSource) {}
@EventHandler('ORDER_CREATED')
async handleOrderCreated(event: OrderCreatedEvent): Promise<void> {
// Idempotency guarantee: skip if this event has already been processed
const alreadyProcessed = await this.db.getRepository(InboxEvent)
.existsBy({ eventId: event.eventId });
if (alreadyProcessed) return;
const available = await this.checkStock(event.items);
if (!available) {
// Bundle inbox record and failure event in the same transaction
await this.db.transaction(async (em) => {
await em.save(InboxEvent, { eventId: event.eventId });
await em.save(OutboxEvent, {
aggregateId: event.orderId,
type: 'INVENTORY_RESERVATION_FAILED',
payload: JSON.stringify({ orderId: event.orderId, reason: 'OUT_OF_STOCK' }),
processedAt: null,
});
});
return;
}
// reserveStock, inbox record, and success event must be bundled in a single transaction.
// If the process dies after reserveStock but before the inbox record, a retry could double-deduct inventory.
await this.db.transaction(async (em) => {
await this.reserveStock(em, event.items);
await em.save(InboxEvent, { eventId: event.eventId });
await em.save(OutboxEvent, {
aggregateId: event.orderId,
type: 'INVENTORY_RESERVED',
payload: JSON.stringify({ orderId: event.orderId }),
processedAt: null,
});
});
}
}Here is the Choreography event flow illustrated as a diagram.
[Order Service]
│ publishes ORDER_CREATED
▼
[Inventory Service] ──── out of stock ───► publishes INVENTORY_RESERVATION_FAILED
│ success │
│ publishes INVENTORY_RESERVED [Order Service] ◄──────────────┘
▼ publishes ORDER_CANCELLED
[Payment Service]
│ publishes PAYMENT_PROCESSED
▼
[Order Service] → confirm orderHowever, as the number of services grew past 5 or 6, it became increasingly difficult to track "where this event is published and where it's consumed." When debugging, you have to open the Kafka console and trace events one by one, and when an event is lost somewhere in the middle or the processing order gets scrambled, the frustration is considerable. You end up having to open the code of every service just to understand the full flow.
Example 2: Orchestration — Centralized Control with an Orchestrator
With Orchestration, a central Saga orchestrator explicitly calls each step and manages state. This approach shines when compensation order matters or there are many steps. Since the entire flow is in one place, "which step failed and what compensation needs to run" is clearly visible from the code alone.
// Saga state type definitions
type SagaStep = 'RESERVE_INVENTORY' | 'PROCESS_PAYMENT' | 'CONFIRM_ORDER';
type SagaStatus = 'RUNNING' | 'COMPLETED' | 'COMPENSATING' | 'FAILED';
interface SagaState {
id: string;
orderId: string;
currentStep: SagaStep;
completedSteps: SagaStep[]; // cumulatively stored in DB; referenced in reverse during compensation
status: SagaStatus;
}
class OrderSagaOrchestrator {
constructor(
private readonly sagaRepo: SagaRepository,
private readonly inventoryClient: InventoryClient,
private readonly paymentClient: PaymentClient,
private readonly orderClient: OrderClient,
) {}
async execute(orderId: string): Promise<void> {
const saga = await this.sagaRepo.create({
orderId,
currentStep: 'RESERVE_INVENTORY',
completedSteps: [],
status: 'RUNNING',
});
try {
// Step 1: Reserve inventory
await this.inventoryClient.reserve(orderId);
// recordStep adds the given step to the completedSteps array and persists it to DB.
// Even after an orchestrator restart, we can tell which steps completed for compensation.
await this.sagaRepo.recordStep(saga.id, 'RESERVE_INVENTORY');
// Step 2: Process payment
await this.paymentClient.process(orderId);
await this.sagaRepo.recordStep(saga.id, 'PROCESS_PAYMENT');
// Step 3: Confirm order
await this.orderClient.confirm(orderId);
await this.sagaRepo.markCompleted(saga.id);
} catch (error) {
await this.sagaRepo.updateStatus(saga.id, 'COMPENSATING');
// Re-fetch the latest completedSteps via findById to use for compensation
await this.compensate(await this.sagaRepo.findById(saga.id));
}
}
private async compensate(saga: SagaState): Promise<void> {
// Run compensation in reverse order only for completed steps
const compensations: Partial<Record<SagaStep, () => Promise<void>>> = {
PROCESS_PAYMENT: () => this.paymentClient.refund(saga.orderId),
RESERVE_INVENTORY: () => this.inventoryClient.release(saga.orderId),
// CONFIRM_ORDER has no compensation — if we reached this step, everything succeeded
};
const stepsToCompensate = [...saga.completedSteps].reverse();
for (const step of stepsToCompensate) {
const compensation = compensations[step];
if (compensation) {
try {
await compensation();
} catch (err) {
// On compensation failure: move to Dead Letter Queue or notify for manual intervention
await this.notifyManualIntervention(saga.id, step, err);
}
}
}
await this.sagaRepo.markFailed(saga.id);
}
}The Orchestration flow is much clearer.
[OrderSagaOrchestrator]
│
├──► inventoryClient.reserve(orderId) ✓ → completedSteps: ['RESERVE_INVENTORY']
│
├──► paymentClient.process(orderId) ✗ failed!
│
│ [Begin compensation — reverse order]
├──► inventoryClient.release(orderId) (compensates RESERVE_INVENTORY)
│
└──► sagaRepo.markFailed(sagaId)Advanced: Leveraging Platforms Instead of Rolling Your Own
Example 3: Durable Execution with Temporal
When implementing your own Saga engine, you keep running into questions like "what happens if the orchestrator restarts?" and "which step do we resume from after a network failure?" As of 2025, more teams are delegating this complexity to a platform, with Temporal being the leading choice.
import { proxyActivities, ApplicationFailure } from '@temporalio/workflow';
import type * as activities from './activities';
const { reserveInventory, processPayment, cancelReservation, refundPayment } =
proxyActivities<typeof activities>({
startToCloseTimeout: '30s',
retry: {
maximumAttempts: 3,
nonRetryableErrorTypes: ['InsufficientStockError', 'InvalidPaymentError'],
},
});
// Temporal durably manages this workflow's state via event sourcing
export async function orderSagaWorkflow(orderId: string): Promise<void> {
let inventoryReserved = false;
let paymentProcessed = false;
try {
await reserveInventory(orderId);
inventoryReserved = true;
await processPayment(orderId);
paymentProcessed = true;
} catch (err) {
// Compensation — in reverse order, continuing even if individual compensations fail
if (paymentProcessed) {
await refundPayment(orderId).catch(() => {
// Handle refund failure separately, e.g., send a notification
});
}
if (inventoryReserved) {
await cancelReservation(orderId);
}
throw ApplicationFailure.create({ message: `Order saga failed: ${orderId}` });
}
}Durable Execution: An execution model that persists the state of workflow code (which steps have completed) using event sourcing, allowing resumption from the point of interruption even after a server restart or network failure. Temporal is the canonical implementation; AWS Step Functions offers similar guarantees.
If you're confident implementing your own Saga engine, building it yourself is a perfectly valid choice. But for a first adoption, starting with a prototype on Temporal Cloud's free tier or AWS Step Functions can significantly reduce implementation cost.
Trade-off Analysis
Choreography vs Orchestration Comparison
| Criterion | Choreography | Orchestration |
|---|---|---|
| Service coupling | Low — no direct inter-service dependencies | High — orchestrator must know each service |
| Overall flow visibility | Low — logic scattered across multiple files | High — full state visible in one file |
| Debugging difficulty | Grows with number of services | Relatively straightforward |
| Complex compensation ordering | Difficult to manage | Explicitly controllable |
| Throughput & scalability | High — well-suited for async processing | Orchestrator can become a bottleneck |
| Independent deployment | Each team can deploy independently | Requires deployment when orchestrator changes |
Decision Criteria by Situation
Honestly, there's no answer of "you must always use this one." Use the criteria below to choose what fits your situation.
| Situation | Recommended Approach |
|---|---|
| 3 or fewer services, simple success/failure flow | Choreography |
| Compensation order matters or 5+ steps | Orchestration |
| Service teams need to deploy independently | Choreography |
| Business-critical workflow requiring audit logs | Orchestration |
| Frequently adding new services to existing flow | Choreography |
| Complex rollback logic requiring manual intervention on failure | Orchestration |
Drawbacks and Caveats
| Item | Description | Mitigation |
|---|---|---|
| Intermediate State Visibility | Other Sagas may read intermediate state data between steps | Review business requirements and decide whether to allow. Consider the Semantic Lock pattern if needed (marking in-progress records with something like status: PROCESSING so other Sagas can filter them out) |
| Compensation failure | Compensating transactions themselves can fail | Dead Letter Queue + manual intervention notification + idempotent compensations |
| Orchestrator single point of failure | In Orchestration, an orchestrator failure halts the Saga | Persist orchestrator state to DB; design for recovery on restart |
| Event explosion | In Choreography, tracking event relationships becomes harder as service count grows | Document an Event Catalog; track with Correlation IDs |
| Non-compensatable actions | Emails, SMS messages cannot be undone | Place them as the last step in the Saga, or use a "scheduled send" approach to defer delivery |
Outbox Pattern: A pattern for atomically handling DB commits and message publishing. Events are saved to an Outbox table within the same transaction as the business data, and a separate process reads and publishes them to the message broker. It prevents situations where a DB commit succeeds but the event publication is missed at any Saga step, making it practically mandatory.
The Most Common Mistakes in Production
-
Not applying idempotency to compensating transactions — Network timeouts can cause the same compensation request to be delivered twice. Including an
idempotency_keyin requests and using an Inbox table to prevent duplicate processing is an effective approach. -
Placing email/SMS delivery in the middle of a Saga — If you send an "order confirmation email" immediately after inventory reservation and the payment fails causing the order to be cancelled, there's no way to recall an already-sent email. External notifications should always be placed as the last step of the Saga.
-
Not persisting Saga state — If the orchestrator restarts or the network goes down and you don't know which steps have completed, you can't properly execute compensation. It's critical to record state like
completedStepsto the DB and design for resumption on restart.
Closing Thoughts
The biggest realization I had while adopting the Saga pattern is that it doesn't solve the problem of "perfect atomicity." Rather, it's about designing how gracefully you handle failures in exchange for giving up perfect atomicity. More important than whether you choose Choreography or Orchestration is ensuring that your team clearly understands and accepts the trade-offs that come with that choice — coupling, visibility, and compensation complexity.
Three steps you can start with right now:
-
Draw the distributed transaction flow for your current services. Write out each step like Order → Inventory → Payment, and fill in a "compensation column" next to each step for how you'd reverse the previous step if that step failed. The gaps in your design will start to become visible.
-
Implement a simple 2–3 step flow first with Choreography. Build the
ORDER_CREATED→INVENTORY_RESERVED→PAYMENT_PROCESSEDflow with Kafka or RabbitMQ, and apply the Outbox pattern at each step as you go. -
Consider switching to Orchestration when the flow reaches 4+ steps or compensation ordering becomes important. Before implementing your own orchestrator, try prototyping on Temporal Cloud's free tier or AWS Step Functions first — it can dramatically reduce implementation cost.
References
- Saga Pattern | microservices.io
- Saga design pattern | Microsoft Azure Architecture Center
- Saga Patterns | AWS Prescriptive Guidance
- Mastering Saga Patterns for Distributed Transactions | Temporal Blog
- To Choreograph or Orchestrate Your Saga | Temporal Blog
- Saga Orchestration with Outbox Pattern | InfoQ
- Compensation Transaction Patterns | Orkes Blog
- Saga Pattern in Distributed Systems | Orkes Blog
- Transactional Outbox Pattern | microservices.io
- Idempotent Consumer Pattern | microservices.io
- The Idempotent-Saga Pattern | Medium
- Saga Pattern Demystified | ByteByteGo
- Getting Started with Eventuate Tram Sagas