This is the second blog post in a short series about Consistency
I love the two generals problem. I am mesmerized by its simplicity and impossibility. And I believe nothing has shaped my professional life more than this problem.
Back in 2008 I had not yet worked with any major distributed systems, and most of my work was with systems which involved heavy aspects of custom electronics; one could say I was into IoT before IoT was a thing. But most of our systems were distributed by nature even then. Smaller scale, but it was quite common for them to exhibit faults similar to what you’d expect from distributed systems. This is obvious if you consider that a typical parking lot (my company’s biggest selling point at the time) involved around 20 network-connected sensors devices with 3-4 controllers (servers), a minimum of 3 payment POS, and central servers. However, even with this small amount of devices involved we did experience distributed system issues - partially connected sensors, payment systems which needed to choose availability over consistency (so that people could get out of the parking in case of a failure), etc. It was after yet another failed attempt to configure the network to be more reliable which I happened across the two generals problem. While it didn’t help me in that situation, it set me down a path which caused me to learn more about distributed systems.
This thought experiment takes us back to Byzantine times, with two generals leading an army to attack a fortified city. They quickly realize that their only chance of winning involves attacking from opposite positions, and soon after one of the two generals notices a break in the city’s defenders. They both reach an agreement about this, but since time depends on a weakening of the defenders, they sit down to discuss and agree on a protocol detailing how they would exchange messages so that they can both attack at the same time.
And this is where we come in.
Our task is to devise a protocol that:
- In the end, the two generals can be certain, without any doubt, of the time they need to attack.
- The protocol takes a finite amount of time to complete
- Can tolerate any number of message in that discussion being lost because the city under siege has spies who can intercept the messengers
A (wrong) example
One possible protocol would be for the first general A to send to the second general, B, an order to attack at a particular time. However, since the messenger he sends may be intercepted, we introduce an acknowledgement: general A only attacks if he receives an acknowledgement from general B.
However, what happens in this case:
General A can figure out that something went wrong, but can’t know what:
- Was his messenger intercepted before reaching general B?
- Was his messenger successful in reaching general B, but the messenger carrying the acknowledgement from general B was intercepted?
Also, what about general B? From his point of view, he did everything right, and doesn’t expect anything back. so he attacks at 3:00am, alone, because general A didn’t get an acknowledgement.
How can we fix this? Someone could suggest that General A should retry if he doesn’t receive an acknowledgement. What happens in this case then? One possibility is this:
🤷♂️ Same outcome.
Give it a go!
No, honestly, please give it a try if you’re not familiar with it. It’s kind of fun!
I thought about sending multiple messengers in parallel, sending multiple messengers to acknowledge. Continuously send messengers (even without an attack message) and do something if someone goes missing (you detect failures like this), and others which I no longer remember. I very honestly suggest you take at least 5-10 minutes to try it out - it really helps to grok the problem.
Feel free to tweak the examples above to play with this. Try plant uml in this awesome online editor. I would love to hear about the protocols you thought about in the comments below! Feel free to post a link from plant-text to make it easier to play with.
Unfortunately, as I said in the beginning of this article, there is no solution to this problem. It’s proven to be impossible 1. It also isn’t difficult to see why. Whatever protocol you may come up with, it can always be subverted by stopping all messengers from the last acknowledgement general B needs to send onwards. Feel free to check out the two proofs on wikipedia for more (although the previous is a simplification of the deterministic proof).
Why is it important?
It is important because it is this thought experiment, and it’s lessons, that leads us to a number of really interesting deductions:
- While guaranteed, time-bounded consistency is not possible, consistency in general is quite possible. Given that the generals simply needed to attack together (and not within a period of time) they could simply retry the process until they succeed in communicating. There is no guarantee that this would ever happen, but the probability of success becomes increasingly likely with every retry.
- Exactly once delivery of a message is impossible
- Two (or more) systems can be either consistent or available in the face of possible interruption of communication between the moving parts. This is the CAP theorem.
- The lessons we can deduce can help us design and deliver more reliable systems
In a next article I plan to expand a bit on exactly-once delivery guarantees, how can the two generals problem be used to prove that, and what can we do about it.
That is unless in the interim I hear someone again who received events in a BC, and changes them to a command just because commands can be rejected. Actually, if you prefer a rant about why I believe this is not our only option, nor the golden standard some people believe it to be (but to be clear, not that it’s wrong), just write in the comment section below.
- 1 : Even though the two generals problem is provably impossible to solve, please don’t let that deter you from trying it out. There are some valuable lessons to be had by trying it out.