Distributed writes get complicated as soon as one business action has to move through several services. One order can require inventory to be reserved in one service, payment to be approved in another, and shipping to begin in a third. With service boundaries like that, a normal local ACID transaction from one service does not span all of those separate databases, so the write is broken into local transactions and a saga coordinator guides the flow from one step to the next. The coordinator tracks the current step, sends the next command, waits for replies, and starts compensation if a later step fails. That gives us staged consistency, where every service commits its own data while the broader business action keeps moving toward completion or a controlled rollback.
Why Saga Coordination Fits Multi-Service Writes
What makes saga coordination useful is the gap between service ownership and business ownership. Inventory, payment, shipping, and orders may live in separate services with separate databases, but the customer action still has to finish as one coherent business flow. Stock cannot stay reserved after payment is rejected, and shipment should not begin before earlier steps settle. Saga coordination gives that cross-service write a recorded sequence and a defined route forward, so the system can tell which local commits already happened, which reply is still pending, and which earlier business effect needs to be reversed if the flow breaks in the middle.
Why One Transaction Stops at Service Borders
Spring gives us @Transactional for transaction demarcation inside an application, and that transaction lives inside the resources managed by that application’s transaction manager. In a microservice split, order-service, inventory-service, and payment-service own different databases, different connection pools, and their own runtime boundaries. That means a service can commit its own local change, but it does not get a shared transaction that naturally wraps the other services too. Spring’s transaction model is thread-bound for regular imperative flows, and the current transaction does not jump into newly started threads either. The moment a business action leaves one service and travels over the network, we are outside the scope of a normal local transaction.
That limitation changes how we think about a multi-service write. We stop asking how to keep a database transaction open across all services and start asking how to move a business action through several local commits without losing track of it. Sagas solve that by shrinking the transaction boundary and making the business flow more explicit. From there, the first service writes its data and records what should happen next. Later, a different service handles its own local transaction and sends back a reply. Progress comes from a stored conversation between services rather than from an open transaction that covers every row touched by the request.
Let’s see an example that makes the boundary easier to see:
package com.example.order;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;
@Service
public class OrderWriteService {
private final OrderRepository orderRepository;
private final InventoryClient inventoryClient;
public OrderWriteService(OrderRepository orderRepository, InventoryClient inventoryClient) {
this.orderRepository = orderRepository;
this.inventoryClient = inventoryClient;
}
@Transactional
public void placeOrder(PlaceOrderCommand command) {
orderRepository.save(Order.pending(command.orderId(), command.total()));
inventoryClient.reserve(command.orderId(), command.sku(), command.quantity());
}
}Reading placeOrder() shows that the call to orderRepository.save() belongs to the local transaction owned by the order service. The call to inventoryClient.reserve() is a separate hop into a different service boundary. If the inventory service commits its own reservation and the order transaction later rolls back, the local transaction cannot reach across the network and undo the inventory change. That is why sagas exist in the first place. They accept that service boundaries are real and give the business action a controlled route through partial success and failure.
Local transaction scope gets easier to follow when we put the database change and the remote call side by side. We can commit the order row inside the order service, but the inventory service still has to decide its own database outcome inside its own transaction. Nothing about the first commit gives us automatic control over the second. Once we accept that boundary, the rest of the saga conversation makes more sense because we stop treating the flow as one giant unit and start treating it as several local decisions tied to the same business action.
How Step Sequencing Moves Forward
Progress in a saga is driven by stored state, not by a request thread staying alive from start to finish. The coordinator keeps a saga id, a business id such as orderId, the current step, the current status, and enough timing data to know what reply is expected next. That lets the business action pause after a local commit, wait for a message, then continue later when the reply arrives. A checkout that begins at noon can still finish minutes later after an external payment reply, because the flow lives in the database and not in a request stack that had to remain open the whole time.
Stored sequencing also gives us a direct way to answer basic business questions. Which step is active right now. Which steps already committed. Which reply never arrived. Which failure code stopped progress. That state record turns a long-running action into something the application can reason about instead of a half-remembered chain of messages. Without that record, replies could still arrive, but the system would have very little context for deciding what they mean or what should happen next.
We can make that easier to read by naming the forward steps:
package com.example.saga;
public enum SagaStep {
RESERVE_STOCK,
AUTHORIZE_PAYMENT,
START_SHIPMENT,
RELEASE_STOCK,
REFUND_PAYMENT
}Status naming belongs right beside it:
package com.example.saga;
public enum SagaStatus {
STARTED,
WAITING,
COMPLETED,
COMPENSATING,
FAILED
}Those enums are not the whole saga, but they let us see the flow as named stages with rules around movement. When we move from RESERVE_STOCK to AUTHORIZE_PAYMENT, inventory already accepted the request. When we move from AUTHORIZE_PAYMENT to RELEASE_STOCK, payment failed after an earlier forward step had already committed. That is sequencing in a very practical form. We turn a distributed write into a state transition problem the application can store and revisit later.
Time belongs in the sequence too. A saga may wait too long for payment approval, shipment booking, or inventory confirmation. At that point, the coordinator needs a stored deadline so it can detect stale progress and decide its next move. That next move could be a retry, a failure state, or a compensation branch, but the decision only makes sense when the waiting state has been written down somewhere durable. Without that, the business action has no memory of what it was waiting for or how long it has been waiting.
What Compensation Really Changes
Compensation is where people sometimes think in terms of a full rewind, but that is not what happens. Earlier writes were committed in local transactions, and those committed facts do not vanish just because a later step fails. Compensation creates new business actions that counter the earlier ones. If stock was reserved, the reverse action is to release the reservation. If payment was captured, the reverse action is to void it or refund it depending on timing and payment-provider rules. If shipment started, the reverse action may be a cancellation request, not deletion of a row as though it never existed.
That distinction matters because business history still has value after failure. Support staff need to know that payment was tried. Finance needs to know that a refund was issued. Inventory needs to know that reserved units were later released. Compensation keeps the system valid without pretending the forward attempt never happened. The business action changes direction, but the audit trail stays intact. That is why compensation should be treated as a fresh local transaction with reverse business meaning, not as database erasure.
Reverse sequencing follows the history of what already committed. If inventory reservation succeeded and payment authorization failed, the saga does not compensate payment because payment never reached a committed success state. It compensates inventory because inventory did commit. If inventory, payment, and shipment all committed and a later shipping rule fails before final completion, compensation begins with shipment cancellation, then refund or void handling if the business flow calls for it, then stock release if reserved stock is still held. The coordinator walks backward through the committed forward steps, not through steps that never happened.
We can express that reverse intent with a small domain record:
package com.example.saga;
import java.time.Instant;
import java.util.UUID;
public record CompensationAction(
UUID sagaId,
UUID orderId,
SagaStep stepToReverse,
String reason,
Instant createdAt) {
}When we read CompensationAction, we can see that the system is not trying to wipe away a failed business action. It is recording a new action with a reason and a target step to reverse. That leads to much better recovery behavior. If a compensation command fails on the first send, the record is still there. If a service restarts, the reverse intent is still there. If manual review is needed, the application still knows what remains undone.
Not every step has a neat reverse command, and that belongs in the model too. Email sent to a customer cannot be unsent. Shipment picked up by a carrier may need a support process rather than an automatic undo step. In those cases, the saga can move into a failed or manual-review state and stop the forward flow. Compensation is a business decision written into the flow, not a universal rollback button that always restores everything to an earlier state.
How Long-Running State Is Stored
Long-running business actions need durable state because the action can outlive a request, a JVM instance, a broker reconnect, or a temporary outage in one participant. That usually leads to three related storage areas. Business rows owned by each service live in the first area, such as orders, reservations, or payment authorizations. Saga progress lives in the second area, including the saga id, business id, current step, current status, timeout values, failure reason, and retry data. Message durability records live in the third area, such as outbound outbox rows and, in several cases, processed-message or inbox records for duplicate protection.
Without those records, long-running consistency becomes fragile very quickly. Think through a service that updates its own order row and then tries to publish a message directly to a broker. If the process stops after the database commit but before the broker send, the local write survives and the next step disappears. The transactional outbox exists to stop that split. We write business data and an outbound message record in the same local transaction, then a later relay publishes the outbox row. That keeps the local state change and publish intent tied to the same commit. Consumer-side deduplication belongs in the same discussion too. Message brokers commonly provide at-least-once delivery, which means the same message can be delivered more than one time. That is why deduplication matters on the receiving side. With a processed-message table or similar marker, a service can see that a message id already succeeded and skip replaying the business effect. Without that, retries meant to keep the flow alive can reserve the same stock twice or apply the same release twice.
Stored timeout data matters for a similar reason. A saga may be waiting for a payment reply that never comes because a broker message was delayed, a participant is down, or an external gateway never answered. The coordinator needs a waitUntil value or something similar so the application can detect that the saga has been waiting too long and decide its next move. That move could be a retry, a failure state, or a compensation branch, but the system cannot make that decision if the waiting state lives only in memory.
What makes saga storage useful is not just that rows exist. The value comes from the fact that the current business position is always recorded somewhere durable. After a restart, the application can still tell which step had succeeded, which command was queued, which reply was missing, and which compensation branch is pending. That is how a long-running business action stays consistent without a shared database transaction. The state is written down, revisited later, and moved forward with full context.
Building the Flow in Spring Boot
Turning saga rules into application code means giving the coordinator a durable record of progress while keeping participant services disciplined about their own data. Spring Boot works well with that model because we can assemble the moving pieces from ordinary beans, repositories, transaction boundaries, scheduled jobs, and observability hooks. Smaller saga flows do not need a dedicated saga runtime at the start. If the state graph grows dense enough that transitions, guards, and actions start crowding the application code, Spring Statemachine is available as a separate option for modeling those transitions more formally.
Coordinator Service Layout
Most coordinators begin as regular Spring application code. We usually keep a service bean that opens the saga, a repository for the saga row itself, an order or business repository for the local domain write, listener methods for replies, and an outbox repository for outbound commands. That keeps the coordinator focused on business progress rather than transport details. It does not reserve stock or approve payment directly. Its job is to record the current step, commit local state, and queue the next command only after that local transaction completes successfully.
We can start with the coordinator method that opens the saga and records durable intent:
package com.example.checkout;
import java.time.Instant;
import java.util.UUID;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;
@Service
public class CheckoutCoordinator {
private final PurchaseOrderRepository purchaseOrderRepository;
private final SagaStateRepository sagaStateRepository;
private final OutboxRecordRepository outboxRecordRepository;
public CheckoutCoordinator(
PurchaseOrderRepository purchaseOrderRepository,
SagaStateRepository sagaStateRepository,
OutboxRecordRepository outboxRecordRepository) {
this.purchaseOrderRepository = purchaseOrderRepository;
this.sagaStateRepository = sagaStateRepository;
this.outboxRecordRepository = outboxRecordRepository;
}
@Transactional
public UUID openCheckout(StartCheckout command) {
UUID orderId = UUID.randomUUID();
UUID sagaId = UUID.randomUUID();
purchaseOrderRepository.save(
PurchaseOrder.pending(orderId, command.customerId(), command.total()));
sagaStateRepository.save(
SagaState.waitingFor(
sagaId,
orderId,
SagaStep.RESERVE_STOCK,
Instant.now().plusSeconds(30)));
outboxRecordRepository.save(
OutboxRecord.command(
UUID.randomUUID(),
sagaId,
orderId,
"inventory.reserve-stock",
"""
{"orderId":"%s","sku":"%s","quantity":%d,"amount":"%s"}
""".formatted(orderId, command.sku(), command.quantity(), command.total())));
return orderId;
}
@Transactional
public void onStockReserved(StockReserved reply) {
SagaState saga = sagaStateRepository.getRequired(reply.sagaId());
saga.moveTo(SagaStep.AUTHORIZE_PAYMENT, Instant.now().plusSeconds(30));
outboxRecordRepository.save(
OutboxRecord.command(
UUID.randomUUID(),
reply.sagaId(),
reply.orderId(),
"billing.authorize-payment",
"""
{"orderId":"%s","amount":"%s","sku":"%s","quantity":%d}
""".formatted(
reply.orderId(),
reply.amount(),
reply.sku(),
reply.quantity())));
}
@Transactional
public void onPaymentDeclined(PaymentDeclined reply) {
SagaState saga = sagaStateRepository.getRequired(reply.sagaId());
saga.moveTo(SagaStep.RELEASE_STOCK, Instant.now().plusSeconds(30));
purchaseOrderRepository.markCancelled(reply.orderId(), "PAYMENT_DECLINED");
outboxRecordRepository.save(
OutboxRecord.command(
UUID.randomUUID(),
reply.sagaId(),
reply.orderId(),
"inventory.release-stock",
"""
{"orderId":"%s","sku":"%s","quantity":%d}
""".formatted(reply.orderId(), reply.sku(), reply.quantity())));
}
}Reading openCheckout() lets us follow three related writes inside one local transaction. We write the business row, we write the saga row, and we write the outbound command into an outbox table. That sequence keeps the coordinator out of a half-finished state. If the process stops after commit, the next command still exists in durable storage and can be published later by a relay. If the transaction rolls back, none of those rows survive in a mismatched state.
The reply handlers follow the same rule. They do not jump straight from a message listener to an external send. Instead, we record the new saga position and queue the next command inside the same local transaction. That keeps the coordinator’s state and its outbound intent moving in lockstep, which is exactly what we want when a business action may pause, resume, retry, or fail across service boundaries.
Publishing outbox rows and watching sagas that have waited too long usually live beside the coordinator rather than inside request handling. Spring scheduling fits both jobs well. @Scheduled methods take no arguments. Boot can auto-configure a scheduler when scheduled execution is in play. With virtual threads enabled on Java 21 or newer, the auto-configured scheduler is a SimpleAsyncTaskScheduler. Without virtual threads, Boot uses a ThreadPoolTaskScheduler with one thread by default unless spring.task.scheduling is tuned. That default is worth remembering if the same scheduler is handling relay publication and timeout scanning.
We can keep the relay small and still make it observable and retryable:
package com.example.checkout;
import java.time.Instant;
import java.util.List;
import io.micrometer.observation.Observation;
import io.micrometer.observation.ObservationRegistry;
import org.springframework.context.annotation.Configuration;
import org.springframework.resilience.annotation.EnableResilientMethods;
import org.springframework.resilience.annotation.Retryable;
import org.springframework.scheduling.annotation.EnableScheduling;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;
@Configuration
@EnableScheduling
@EnableResilientMethods
class CheckoutFlowConfig {
}
@Component
class OutboxRelay {
private final OutboxRecordRepository outboxRecordRepository;
private final OutboxPublisher outboxPublisher;
OutboxRelay(
OutboxRecordRepository outboxRecordRepository,
OutboxPublisher outboxPublisher) {
this.outboxRecordRepository = outboxRecordRepository;
this.outboxPublisher = outboxPublisher;
}
@Scheduled(fixedDelay = 500)
public void publishPendingCommands() {
List<OutboxRecord> batch =
outboxRecordRepository.findTop50ByPublishedAtIsNullOrderByCreatedAtAsc();
for (OutboxRecord record : batch) {
outboxPublisher.publishSingle(record);
}
}
}
@Component
class OutboxPublisher {
private final OutboxRecordRepository outboxRecordRepository;
private final BrokerPublisher brokerPublisher;
private final ObservationRegistry observationRegistry;
OutboxPublisher(
OutboxRecordRepository outboxRecordRepository,
BrokerPublisher brokerPublisher,
ObservationRegistry observationRegistry) {
this.outboxRecordRepository = outboxRecordRepository;
this.brokerPublisher = brokerPublisher;
this.observationRegistry = observationRegistry;
}
@Retryable(maxRetries = 4, delay = 100, multiplier = 2, maxDelay = 1000)
@Transactional
public void publishSingle(OutboxRecord record) {
Observation.createNotStarted("saga.outbox.publish", observationRegistry)
.lowCardinalityKeyValue("destination", record.destination())
.observe(() -> {
brokerPublisher.send(record.destination(), record.payload(), record.headers());
outboxRecordRepository.markPublished(record.id(), Instant.now());
});
}
}Reading publishSingle() keeps publication narrow in scope. We are not reopening saga logic there. We take a pending outbox row, try to send it, and mark it as published only after the broker call succeeds. The scheduled relay calls a separate Spring bean, so @Retryable and @Transactional are reached through the proxy instead of being skipped by a same-class method call. In Spring Boot 4 with Spring Framework 7, @Retryable belongs to Spring Framework’s resilience support, and @EnableResilientMethods turns on annotation processing for @Retryable and @ConcurrencyLimit. On the observability side, Spring Boot builds on Micrometer Observation, and ObservationRegistry gives us the entry point for custom observations that become metrics and traces. That gives the coordinator better visibility when a broker is slow, a relay is retrying, or a destination is failing more than expected.
Timeout handling belongs beside the relay, not buried inside request code. If a saga has been waiting too long, a scheduled scanner can find expired rows and move them toward retry, failure, or compensation. Smaller flows can keep that logic in repository queries and service methods. Later on, if the transition rules become dense and repetitive, Spring Statemachine starts to earn its place because it gives us guards, actions, and a fuller state-machine vocabulary.
Participant Message Handling
Handlers inside participant services stay healthier when their responsibility remains narrow. At message arrival, we check if that message was already handled, we apply one local transaction to that service’s own data, and we record a reply for later publication. That sequence keeps ownership where it belongs. Inventory decides inventory, billing decides payment, and shipping decides shipment state. The participant does not try to reconstruct the full saga history because the coordinator already owns the broader view.
Let’s see how we can read that flow in a stock reservation handler:
package com.example.inventory;
import java.time.Instant;
import java.util.UUID;
import org.springframework.dao.DataIntegrityViolationException;
import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;
@Component
public class StockCommandHandler {
private final ProcessedMessageRepository processedMessageRepository;
private final StockLedgerRepository stockLedgerRepository;
private final OutboxRecordRepository outboxRecordRepository;
public StockCommandHandler(
ProcessedMessageRepository processedMessageRepository,
StockLedgerRepository stockLedgerRepository,
OutboxRecordRepository outboxRecordRepository) {
this.processedMessageRepository = processedMessageRepository;
this.stockLedgerRepository = stockLedgerRepository;
this.outboxRecordRepository = outboxRecordRepository;
}
@Transactional
public void handle(ReserveStock command) {
try {
processedMessageRepository.saveAndFlush(
new ProcessedMessage(command.messageId(), Instant.now()));
} catch (DataIntegrityViolationException ex) {
return;
}
boolean reserved = stockLedgerRepository.reserveIfAvailable(
command.sku(),
command.quantity(),
command.orderId());
if (reserved) {
outboxRecordRepository.save(
OutboxRecord.event(
UUID.randomUUID(),
command.sagaId(),
command.orderId(),
"checkout.stock-reserved",
"""
{"orderId":"%s","amount":"%s","sku":"%s","quantity":%d}
""".formatted(
command.orderId(),
command.amount(),
command.sku(),
command.quantity())));
} else {
outboxRecordRepository.save(
OutboxRecord.event(
UUID.randomUUID(),
command.sagaId(),
command.orderId(),
"checkout.stock-rejected",
"""
{"orderId":"%s","sku":"%s","quantity":%d}
""".formatted(command.orderId(), command.sku(), command.quantity())));
}
}
}When we read handle(), we can see the duplicate guard happen before inventory state is touched. The handler tries to insert the ProcessedMessage row first. If the message id already exists, the database rejects that insert and the handler returns without touching stock. Without a processed-message marker, the same reservation could be applied twice.
After the duplicate guard, we call a repository method named reserveIfAvailable() rather than loading stock into Java memory and writing it back later. That naming points to a deeper rule. Participant writes usually read best as conditional database updates so the database decides who won the race. Read-then-write logic in application memory opens the door to conflicting reservations under concurrent load.
Reply publication follows the same durability rule as the coordinator. We do not update stock and then try to publish the reply as an unrelated side effect. We write the reply into an outbox row inside the same local transaction. If the transaction commits, the data change and the reply intent survive as a pair. If the process stops after that commit, the relay can still publish the waiting reply later. That is what lets the coordinator resume the saga without depending on a fragile best-effort send inside the participant transaction.
Compensation commands follow the same structure as forward commands, which keeps participant handlers manageable as the saga grows. Stock release commands still check deduplication, still apply one local transaction, and still write a reply into the outbox. Payment voids or refunds follow the same local rule inside billing. Shipment cancellation follows the same local rule inside shipping. What changes is the business meaning of the command, not the discipline around how that command is handled. Keeping forward and reverse handlers mechanically similar makes failures less surprising and gives support staff a more useful trail when they have to trace what happened to a saga id.
Tracing becomes very practical at this stage. Saga ids should travel in message headers, logs, and trace attributes so we can follow the same business action across coordinator, relay, and participants without relying on timestamp guesswork. Spring Boot Actuator provides auto-configuration for Micrometer Tracing, and Boot’s observability support uses Micrometer Observation for metrics and traces. With that in place, a slow participant, a repeating redelivery, or a failing compensation command becomes far quicker to trace from one service boundary to the next.
Conclusion
Saga coordination keeps a multi-service write coherent by breaking the full business action into local transactions, recording progress after every committed step, and pushing the next command only after that state is stored. From there, reply handling, timeout checks, and compensation all follow the same recorded flow, which lets the system recover from partial failure without relying on a shared database transaction. In Spring Boot, that usually comes down to a coordinator that tracks state, an outbox that carries durable message intent, and participant services that change only their own data while reporting the outcome back into the saga.


