Reliable Systems

One day you wake up as a pharaoh of the Fourth Dynasty. Your position obliges you — you must get up and build a monument the size of your vanity.

Calculations on papyrus show that the construction will require about 100 million man-hours. You want to finish the project in a reasonable time, say 20-30 years. It’s clear that on such a scale, many things can go wrong. Therefore, you need a fault-tolerant distributed system.

The counterintuitive idea of such systems is that from unreliable parts, you can assemble a reliable whole. In theory, to achieve this, the probabilities of parts failing need to be independent. In practice, this translates to having no single points of failure + failures should not trigger a chain reaction.

A single point of failure is when the masons all drink from the same well. A chain reaction is when one crew takes the entire supply of white limestone for urgent repairs → facade work stops → the pharaoh gets angry → urgent dismantling of already laid blocks is required → construction delays.

It’s a good thing that system decomposition, as well as replication, redundancy, and graceful degradation were invented long ago. It’s also good that nowadays this is so commonplace that it has moved into the category of “obviously, how could it be otherwise.”