We all know the joke about distributed systems right?
(btw you should go follow Mathias if you don’t already)
There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery— Mathias Verraes (@mathiasverraes) August 14, 2015
Exactly-once delivery defined
First of all:
Exactly-once delivery guarantee is the guarantee that a message can be delivered to a recipient once, and only once. While having a message be delivered only once by a recipient, is the norm, it is impossible to guarantee it.
Proof by contraposition and the two generals problems
The similarity between the two generals problem and exactly-once delivery is quite striking. We again have two parties trying to communicate some intent, with a potential for loss of packages. Where the two generals had to agree on a time to attack, here the two processes have to agree that the second has successfully received the message.
Let’s assume that a protocol exists which guarantees that a recipient receives a message from the sender once and only once. Such a protocol could then solve the two generals problem! Representing the time of the attack as the message, the first general (the sender) would only need to adhere to the protocol for the second general (recipient) to have received the attack time exactly one time. However, since we know that this is not possible, we also know that exactly once is not possible.
In case it’s easier to conceptualize, I’ll try to prove the impossibility of exactly-once using a direct proof. For this, let’s assume:
- The sender and recipient are operating in the real world, meaning non-zero transport and processing times (and no strict consistency 2)
- The sender and recipient do not have access to each other’s internal state. This means that the recipient is not aware of the intent to send a message unless told by the sender
- Transport of the messages are unreliable
- The recipient has to do some processing on the message (this means the message cannot simply be dropped or lost)
- The operation is time-bound: processing has to complete within a non-infinite period of time
Points 1 & 2 from above discount the existence of a trivial solution which would rely on inspection of the internal state of the recipient.
The sender cannot know if a message was delivered since transport is unreliable; thus, one or more acknowledgement messages are required. Moreover, the sender cannot distinguish between message delivery errors, acknowledgement delivery errors, or delays (either in processing or because or network unreliability).
The recipient is forced to send the acknowledgement only after the message is either processed (or persisted for processing) because an acknowledgement before processing would not work: if the recipient exhibits a fault before processing that would cause the loss of the message.
In the case that that acknowledgement is lost, the sender can’t know (due to lack of insight in the recipient’s state) whether the recipient failed before scheduling the message for processing (in essence losing the message) or if the recipient is just running a bit slow, or if the acknowledgement message was lost. Now, if the sender decides to re-deliver, then the recipient may end up receiving the message twice if the acknowledgement was dropped (for example). On the other hand, if the sender decides to not re-deliver, then the recipient may end up not processing the message at all if the issue was that the message wasn’t scheduled for processing.
If exactly-once is impossible, what can we do?
The above is quite a significant result, and it’s something we need to internalize when we’re designing distributed systems. A system will typically deliver messages just once (in the absence of logic bugs), but when errors happen the system will behave in one of two ways:
- A message is not delivered at all, or delivered but lost before any state changes occur in the recipient, leaving the recipient in the same state it was before receiving the event.
- A message gets delivered more than once.
The first case is what we call the “at-most-once” delivery guarantee while the second, being the mirror-opposite, is called the “at-least-once” delivery guarantee.
As I already explained above, the at-most-once guarantee means that we either deliver a message successfully or not deliver it at all. This is a straightforward guarantee to implement and support because it involves the smallest “effort” from our part:
- The sender pushes the message to the recipient. Both of them ignore any errors and timeouts, and the sender does not require an acknowledgement
- The recipient does nothing special either, or (if needed by transport), acknowledges the message before any side-effects happen
The above essentially means that this is by far the easier of the delivery guarantees to support as developers and system designers. Because of this, if the at-most-once guarantee is adequate for your requirements: choose it! If your subject matter experts confirm that some losses are acceptable, choosing to support this guarantee will be the easy choice and the one which has the least possibility of bugs.
It is essential, however, to note that in most cases while we’re OK to drop a message, we still want to record the loss. This will alert someone to a potential issue in our infrastructure, and potentially allow someone to manually correct things.
While the at-most-once guarantee is much simpler, in practice most often you’ll find that you need to guarantee at-least-once delivery for several possible reasons:
- The experience for users is best because they don’t need to retry actions manually; at worst they’ll experience some delay
- We do not lose information
- The changes that need to be done, actually get done
The difficulty mentioned above of guaranteeing at-least-once delivery isn’t due to any particular algorithmic complexity, or complexity in code. Indeed what we need to do to support at-least-once guarantees is:
- The sender pushes messages and waits for acknowledgements with a timeout. If no acknowledgement is received before the timeout expires, the message delivery is retried
- The recipient first processes the message, acting on any side-effects, and only if those are successful does the recipient acknowledge the message
It is quite common for the above to be handled in the infrastructure layer so that they are transparent to the application layer (and obviously your domain layer as well). The infrastructure layer can receive the message -handling any integration concern- before passing it to the application layer, and acknowledge the message on successful processing in application layer (found either through return value or lack of thrown exceptions depending on language and paradigm used).
So where is the difficulty in supporting at-least-once guarantees? It lies in that we need to make sure that potential errors do not stop us from adhering to the above. Things we need to be aware of:
- Swallowing exceptions
- Dead letters (eventually causing you to drop the message)
- Bugs that cause system state to not change in response to the message
- Errors with side-effects
- Bugs in dependencies or infrastructure
I think it is important to emphasize this, even though it was mentioned above: side effects may be triggered multiple times. This means that if the side effect is taking payments, some payments may be requested twice, or if you’re sending an email, you may end up seeing two emails in your inbox.
The non-option: no guarantees
So, in fact, there is yet a fourth alternative -in fact, the easiest one- of delivery guarantees: not having any.
However, I would debate that this does not represent an option for us for the simple reason that it can cause your system to not be consistent, which violates one of my favourite rules for system design: the rule of least surprise 3.
Deus ex machina: Exactly-once processing
As I mentioned earlier, the only real options we have is either losing a message or receiving it more than once. And with at-least-once, we have the risk of doing things twice, sometimes with catastrophic results (I have some horror stories on the subject). So? Are we doomed?
While exactly-once-delivery is not possible, we have a way out: Exactly-once processing. Exactly-once processing is the guarantee that even though we may receive a message multiple times, in the end, we observe the effects of a single processing operation. This can be achieved in two ways:
- Deduplication: dropping messages if they are received more than once
- Idempotent processing: applying messages more than once has precisely the same effect as applying it exactly once
Well, obviously the next article is about exactly-once processing. We’ll be seeing how it’s possible, how deduplication and idempotency can be implemented, and some gotchas.
Anyway, it took me a long time to write this article, but very honestly, I enjoyed writing it. I hope you enjoyed reading it 😁 See you again next time!
- 1 : I hope this covers for the meme quota for this post.
- 2 Strict consistency: Strict consistency is the an impossible consistency model which has all processes be instantly and immediately consistent. This type of consistency is (obviously) not feasible in the real world as time is required to transfer data (if for no other reason). It is only useful as a theoretical construct.
- 3 Principle of least surprise: The principle of least surprise -sometimes stated as principle of least astonishment-, is a very useful principle from user interface design, which I find applies equally well in system design. A system should not behave in surprising ways to it’s users, but neither should it behave in surprising ways to the people supporting it in production or maintaining it by adding features to it. Surprises can be in the form of unexpected behaviour, but also not adhering to standards and practices in code, or inconsistently applying these. In the context of this article the principle of least surprise is used to refer to the behaviour of a system needing to be consistent.
Further reading material
- You Cannot Have Exactly-Once Delivery by Tyler Treat
- The FLP paper, proving impossibility of consensus between distributed systems with even one failure Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (April 1985), 374–382.
- Danny Dolev, Cynthia Dwork, and Larry Stockmeyer. 1987. On the minimal synchronism needed for distributed consensus. J. ACM 34, 1 (Jan. 1987), 77–97.
- Jennifer Lundelius Welch. 1987. Simulating synchronous processors. Inf. Comput. 74, 2 (Aug. 1987), 159–171.