MassTransit with Azure Service Bus - Error management

When working with distributed messaging systems, failures are inevitable. Whether it's transient network issues, processing errors, or service unavailability, handling and addressing failures gracefully is crucial.

In this post we explore how with can deal with messages that can’t be processed by MassTransit endpoints and how to recover messages that end up in erorr quues or DLQs.

Handling failures in MassTransit 🔗

MassTransit allows us to configure different policies for handling transient failures:

  • Retries: Automatically retry a failing message with exponential backoff. Has some policies we can apply
  • Circuit Breakers: Prevent overwhelming a failing service with continuous retries.
  • Redelivery : Move a message to an error queue after exceeding retry attempts, for a much later redelivery

These mechanisms are the first step in ensuring resilience against transient failures and preventing message loss. However, they can’t guarantee that a message will be successfully processed before the retry attempt is exhausted.

If that happens, the messages are moved either an Error queue, or a Dead-Letter queue. That’s the last stop for them, they go dormant until we wake them up and reprocess them. MassTransit, by convention, will create <queuename>_error for any queue from where the message couldn't be processed sucessfully, unless explicitly told otherwise.

Recovering failed messages 🔗

All messages that end up in error queues should be considered valuable to the business. They can end up there for various reasons: the database is down, there is a network glitch, the consumer server is slow, etc. Let’s say that the consumer endpoint was down and is now up and running again, and we are ready to replay all the failed messages. From this point on, nothing could automagically start the processing flow from where it failed.

We would have to dig through the failed messages and restart the flow. If you have one or a few messages, the process can be painless enough, but what if you have hundreds? Or thousands? What then?

Luckily, the recoverability processes can be done with the help of Service Bus Explorer or by setting up a dedicated consumer.

Although this works like a charm, it has basic capabilities. It forces you to click through a not-so-simple interface endlessly and to think in engineering terms.

More than that, it is almost impossible if you want to give better visibility over the system to someone not involved in the development process or allow anyone to help correct the errors. Let’s face it:  Azure Portal is not the friendliest place to be. I am sometimes scared not to click the wrong button.

With this aside, reprocessing a large number of messages from the dead-letter queue (DLQ) or an error queue can be challenging for anyone, especially when there is no way to group them efficiently.  Each message must be manually retrieved and inspected without any categorization or sorting capabilities.

This lack of features increases operational overhead and slows recovery efforts, making it difficult to restore normal processing quickly. We do want to visibility and avoid tedious work right, right?

How about if I told you we have a platform that can be easily integrated with MassTransit endpoints that use Azure Service Bus, making message reprocessing effortless?

Recovering failed messages the easy way 🔗

MassTransit Error Management is a platform that can be easily set up, and removed, runs in Docker, and knows how to talk error queues, and DQLs.

The Particular team set up an interesting demo, very well documented that works with Azure Service Bus and RabbitMQ. Feeel free to have a look.

Now, let’s explore the platform named ServicePulse. This integrates with MassTrasit endpoints and allows you to specify which queues you want to monitor. As soon as it detects failed messages in those queues, it will transport these messages in this centralized dashboard where you can easily group, sort, retry, edit, and even delete the failed messages.

So let's see what it brings to the table:

Visualizing all messages 🔗

The platform exposes different views for failed messages: Groups, All, Deleted, and All Deleted, all in a very friendly and interface.

It gives you an birds-eye view of the:

  • Type of the message that failed.
  • Time of failing
  • Endpoint where the message processing failed
  • Machine that processed the message
  • The exception of the failing message

From here you can select the messages that you want, and you can retry them or delete them.

Grouping messages and sorting 🔗

Grouping and sorting is an importan feature as it can really help you save time. When you have many error message types and you want to skim trough them fast. If you know that a Shipping consumer failed to process the messages, then you can retry the entire group. In the image below you will see 3 distinct groups.

Visualizing and editing individual messages 🔗

Each message can be viewed independently too. We can see the Stacktrace, Headers and the Body of the message with all the details.

The message can be retried as it is, or we can Edit & Retry to apply changes to the body or header.

Retrying groups of messages 🔗

Retrying groups of messages when you don't want to edit anything is very straightforward. Below you can see two groups of messages:

  • the first one was already retried, and had 2 messages
  • the second one is queued for retrying, and it has 3 messages

Deleting the failed messages 🔗

Although all messages are valuable, since they contain business data, there might be cases when you would want to discard some of the messages.

If anyone deletes a message by mistake, the message is gone forever, without any backup. You would have not way of recovering that.

The Particular platform keeps the deleted messages for 15 days, just in case you change your mind and you want to restore them, or in case some were deleted by mistake.

We like it or not, in distributed system errors are inevitable and we shouldn't be scared of those. What we should do instead is to try and adopt the strategy and tools to allow us to move fast enough and recover from them.

I've seen teams struggling to recover messages because they didn't have the right permissions or weren't used to the Azure Portal.

Sometimes not using the right tools can lead to delays, inconsistencies, and additional complexity in handling error messages effectively.

Summary 🔗

Recoverability is a quality attribute that is not easy to obtain and often overlooked, but the MassTransit erorr manangement platform helps with this.

Instead of manually handling each message one by one, this solution provides intelligent grouping, batch processing significantly reducing operational overhead. With built-in visibility, filtering, and retry controls, you can quickly identify failed messages, apply corrective actions, and ensure seamless recovery— all without disrupting your existing system. Plus, is user-friendly!