When Business Processes Outlive Your Request

⏱ 15 min read

A request has a lifecycle measured in milliseconds. A business process has one measured in minutes, hours, or days - and the gap between those two timelines is where most distributed systems go wrong.

The Millisecond and the Multi-Day Process 🔗

Think about what happens when a customer places an order on an e-commerce platform. From the user's perspective, they click "Buy" and wait a few seconds for a confirmation. But behind the scenes, that single click might kick off a workflow that takes days to fully complete: payment authorisation, inventory reservation, warehouse picking, carrier handoff, customs clearance if it's international, and eventually delivery confirmation.

That workflow doesn't live inside a single HTTP request. It can't - not because of any technical limitation on our end, but because the world it's coordinating with operates on a different clock. The payment provider has its own settlement windows. The warehouse runs on shifts. The carrier updates its tracking on its own schedule. And if any of those steps fail mid-flight, the system needs to know what to do, even though the original HTTP connection closed three days ago.

This isn't unique to e-commerce. Loan approval workflows sit in review queues for hours before an underwriter picks them up. User onboarding flows wait for email verification and background checks before granting access. Subscription billing retries over days when a card declines. The moment a business process outlives its triggering request, you have a different class of problem - and request-response thinking is the wrong tool for it.

Why Request-Response Thinking Breaks Down 🔗

The request-response model is beautiful in its simplicity. A client sends a request, the server does work, the server responds. The connection is open the whole time; if something fails, you get an exception; if everything succeeds, you get data back. It's easy to reason about because one thing happens at a time, in one place, under one transaction.

sequenceDiagram
    participant Client
    participant Server

    Client->>Server: Request
    Server-->>Client: Response

Distribute that workflow across multiple services, though, and the guarantees evaporate fast. Consider an order fulfillment flow that calls three services: one charges the card, one reserves inventory, and one sends a confirmation email. If the inventory service fails after the card charge goes through, what state are you in? The database transaction that might have protected you in a monolith doesn't span service boundaries. Each service has already committed its own local transaction. You can't "rollback" a charge that's already been authorised. You can't "un-send" an email.

sequenceDiagram
    participant Client
    participant OrderService as Order Service
    participant PaymentService as Payment Service
    participant InventoryService as Inventory Service
    participant EmailService as Email Service

    Client->>OrderService: POST /orders
    activate OrderService
    OrderService->>PaymentService: Charge card
    PaymentService-->>OrderService: ✓ Authorised
    OrderService->>InventoryService: Reserve stock
    InventoryService-->>OrderService: ✗ Out of stock
    Note over PaymentService: Card already charged
    Note over EmailService: Never reached
    OrderService-->>Client: 500 Error
    deactivate OrderService

What you get instead is partial completion - a system where some steps succeeded, some failed, and the overall state is inconsistent in ways that are not immediately obvious. And this isn't a corner case you can architect away. Services fail mid-flight. Networks drop connections. Deployments happen. The question isn't whether partial completion will occur; it's whether your system knows how to recover when it does.

There's also the state drift problem. In a long-running workflow, every service involved maintains its own view of the process state. If those views diverge - because a message was lost, a retry was missed, or a timeout fired at the wrong moment - you end up with services that disagree about where the workflow is. The payment service thinks the order is complete. The inventory service thinks it was cancelled. Nobody is wrong according to their own data. The system is just inconsistent.

Sagas: A Fast Recap 🔗

The Saga pattern was formalised by Hector Garcia-Molina and Kenneth Salem back in 1987, originally in the context of long-lived database transactions. The core idea is simple: instead of one large distributed transaction that would require coordination across all participants (which doesn't scale and is largely impractical in practice), you break the workflow into a sequence of smaller local transactions, each of which commits independently. And for each of those steps, you define a compensating transaction - a business action that semantically reverses what that step did, should something go wrong later.

If step 4 fails in a five-step workflow, you run the compensating transactions for steps 3, 2, and 1 in reverse order. You end up in a state that's consistent again - not identical to where you started (the compensation itself is a new business event), but coherent.

That's the pattern in one paragraph. Two things to hold onto: the sequence of local commits, and the compensations that undo them. Everything else - orchestration vs choreography, timeout handling, observability - is about how you implement that idea well in production.

Designing the Saga: Orchestration vs Choreography 🔗

There are two ways to coordinate a saga, and choosing the wrong one for your context creates pain that's hard to unwind.

Orchestration means there's a central coordinator - a saga state machine or process manager - that drives each step. It sends commands to services, waits for responses, tracks the overall state, and decides what happens next. The workflow logic lives in one place. If you need to know where a given order is in the fulfillment process, you ask the coordinator. This is intuitive. It maps well onto how humans think about workflows, and it makes complex compensations straightforward to implement because the coordinator has the full context.

Orchestration - workflow logic lives in the coordinator:

flowchart TD
    C([Coordinator])
    C -->|1. ChargeCard| P[Payment]
    C -->|2. ReserveStock| I[Inventory]
    C -->|3. CreateShipment| S[Shipping]

Choreography means there's no central coordinator. Each service reacts to events published by previous steps. Service A completes its work and publishes OrderValidated. Service B subscribes to that event, does its part, and publishes StockReserved. Service C picks that up and publishes PaymentProcessed. The workflow emerges from the interactions - no single service orchestrates it. This scales better, removes a central point of coupling, and lets each team own their domain completely.

Choreography - workflow emerges from events, no coordinator:

sequenceDiagram
    participant P as Payment
    participant I as Inventory
    participant S as Shipping

    Note over P,S: No coordinator — services react to events
    P-)I: OrderPaid
    I-)S: StockReserved
    S-)P: ShipmentCreated

So how do you choose? I look at a few things. If the workflow has complex branching logic - "if the payment fails, route to retry; if the retry budget is exhausted, escalate to manual review; if manual review times out, cancel and notify" - that logic has to live somewhere. In choreography, it ends up scattered across services that need to know about each other's states, which tends to create hidden coupling. Orchestration keeps it in one place. On the other hand, if the steps are genuinely independent domain reactions - each service truly just needs to know "this thing happened, now I do my part" - choreography gives you better isolation and resilience.

Where I see teams go wrong is mixing both styles within a single workflow. An orchestrator that also listens to events from services it's supposed to be coordinating creates a tangle. Pick one model per workflow, and be deliberate about it. Choreography is not "orchestration without writing a coordinator" - it's a different design choice with different trade-offs.

Compensating Transactions Done Right 🔗

Here's the thing that trips up almost every team the first time they design a saga: compensation is not rollback. They look similar on the surface - something went wrong, we go back to where we were - but they're fundamentally different operations.

Rollback is a database primitive. When a transaction fails, the database engine atomically reverses every change as if the transaction never happened. It's invisible to the business. It cannot fail. It's symmetric - the before and after states are identical.

Compensation is a business action. Issuing a refund is not "un-charging" a card - it's a new transaction that creates a new fact in the payment ledger. Cancelling a reservation is not "un-reserving" inventory - it's releasing stock back to the available pool, which may then be picked up by another order. Sending a cancellation email is not "un-sending" the confirmation - it's a new message with a new meaning. Compensation can fail. It can be partial. It's visible to the business and to the customer.

This matters because if you design compensations as if they were rollbacks, you'll miss important cases. The most important question to ask when designing a compensating transaction is: "What if this compensation fails?" Most teams never ask it. A failed compensation is its own error case that needs its own handling - possibly a manual intervention workflow, an alert to an operations team, or a retry with a longer delay.

Design compensations upfront, not as afterthoughts. For every step in your saga, before you write the forward path, write the compensating action. Name it explicitly: RefundPayment, ReleaseReservation, SendCancellationNotification. Some steps won't have a neat compensation - sending an SMS is very hard to undo. Model those as non-compensatable steps and think carefully about where in the saga sequence they appear.

Every compensating transaction must also be idempotent. When your saga state machine executes a compensation, it might execute it more than once - due to a retry after a timeout, or a network blip that prevented the acknowledgment from reaching the coordinator. Executing the same compensation twice must produce the same result as executing it once. The refund must not be issued twice. The reservation must not be released twice and then fail on a third attempt. Design for this from the start, not after you find the bug in production.

Timeouts, Retries, and the Messy Middle 🔗

Long-running workflows spend a lot of time in a state I call the messy middle: you've sent a command to a service, and you don't know what happened to it.

The message might have been delivered and processed successfully but the acknowledgment was lost in transit. The message might never have arrived at all. The service might have received it, started processing, crashed halfway through, and is now recovering. The service might have processed it and responded, but your network blipped and you missed the response. From the saga coordinator's perspective, all of these look identical: silence.

sequenceDiagram
    participant C as Coordinator
    participant S as Service

    Note over C,S: Scenario A — message never arrived
    C--xS: Command
    C->>C: ⏱ Timeout

    Note over C,S: Scenario B — service crashed mid-processing
    C->>S: Command
    Note over S: Crash
    C->>C: ⏱ Timeout

    Note over C,S: Scenario C — acknowledgment lost
    C->>S: Command
    Note over S: ✓ Processed
    S--xC: Ack
    C->>C: ⏱ Timeout

    Note over C: All three look identical to the coordinator

This is the fundamental challenge that makes distributed systems hard. The timeout fires, and you need to make a decision: did that step complete or not?

The answer is almost always: assume it might have completed, and design your retry as if it did. Send the same command again with the same idempotency key. A well-designed service will recognise the duplicate and return the result of the first execution without re-processing. A poorly designed service will execute the operation twice. This is why idempotency isn't optional in saga steps - it's the mechanism that makes the messy middle safe to navigate.

For retry policies within sagas, I think in terms of three categories. Transient failures - network blips, brief overloads - should be retried quickly, with exponential backoff and jitter to avoid thundering-herd effects. Uncertain states - where you timed out and don't know the outcome - should be retried with idempotency keys. Permanent failures - a 400 Bad Request, a business rule violation, an external service that's permanently shut down - should not be retried at all. Retrying a permanent failure is just executing the same bug repeatedly.

Timeout policies should be defined per step, based on that step's expected latency SLA. Don't use a single global timeout for everything - a warehouse reservation might legitimately take 30 seconds while a payment authorisation should complete in under 3. When a timeout fires in a saga, you have three options: retry the step, escalate to manual review, or start the compensation sequence. Which one you choose depends on the business rules - and those rules should be explicit in your saga design, not emergent from whatever happens to work in tests.

Observability: Seeing Inside the Black Box 🔗

A long-running workflow that spans multiple services and days of calendar time is, without the right observability, a black box. You know when it started, and you know when it ends. Everything in between is inference.

The most important thing you can do is establish a correlation ID from the very first event, and propagate it through every message, every log line, every trace span for the entire lifetime of the workflow. The correlation ID ties together every bit of activity - the payment charge, the inventory update, the shipping label generation - into a single coherent narrative. Without it, debugging a workflow failure means manually correlating timestamps across five services' logs, which is miserable and error-prone.

Beyond correlation, you need saga state snapshots - a persistent record of what state the saga is in at each point in time. Every time the saga transitions - a step completes, a timeout fires, a compensation begins - record that transition with a timestamp. This gives you a queryable audit log: "where is order 12345 in the workflow right now?" and "what was the sequence of events that led to this saga being stuck in the compensation phase?" A good saga state machine will handle this automatically; if you're rolling your own, it's the first thing to build.

Think about what you need to monitor operationally:

  • How many saga instances are currently in-flight, by workflow type and by state?
  • What's the p99 completion time for each workflow type, and how has it changed over time?
  • How many sagas are currently stuck - in a state they haven't left for longer than expected?
  • What's the rate of compensations being triggered, and which saga steps are causing them most often?

That last metric is particularly valuable. If you start seeing a spike in compensations triggered from a specific step, that step is degrading. You can catch this before it turns into a customer-visible incident.

Five Things Worth Remembering 🔗

  1. Model the unhappy paths first. The happy path is easy - services cooperate, everything succeeds. The value of saga design is in making failure explicit: defining what "undo step 3" means as a named business operation before you write step 3.

  2. Compensation is not rollback. It's a new forward action with its own failure modes. Ask "what if this compensation fails?" for every step, and have an answer.

  3. Choose orchestration or choreography deliberately. Both are valid. Orchestration wins for complex branching and human-in-loop steps. Choreography wins for genuinely independent domain reactions. Mixing both within a single workflow is a sign you haven't committed to a model.

  4. Every saga step must be idempotent. The messy middle is unavoidable in distributed systems. The mechanism that makes retries safe is idempotency, and it needs to be designed in from the start, not retrofitted.

  5. Observability is not optional. A workflow you can't see is a workflow you can't operate. Correlation IDs, state snapshots, and operational metrics are the minimum.

Whatever your stack - MassTransit, NServiceBus, Temporal, AWS Step Functions, or something you've built yourself - these patterns apply. The frameworks help with the plumbing, but the design thinking is on you.


If any of this resonates with problems you're currently dealing with, or if you disagree with any of my choices here, reach out. I'd genuinely like to hear how your team has approached it.

PS: Let me know if I forgot anything.

Enjoyed this post?

Join my newsletter and get notified about new posts on .NET and the world around it.