Failures In Distributed Databases

by simbo1905

The “Morning Paper” blog has some fascinating insights into critical production failures in distributed data-intensive databases such as Cassandra and HBase. This reveals that simple testing can prevent most critical failures. In this blog post we take a quick look to see what an embeddable Paxos algorithm project such as Trex can learn from this study. 

Of 198 randomly sampled user-reported critical failures from five data-intensive distributed systems (Cassandra, HBase, HDFS, MapReduce, Redis):

  • 92% of catastrophic failures were due to bugs in error handling of non-fatal errors
  • 33% were trivial mistakes in error handling (such as a “//TODO”, System.exit(), or only a log line) that can be easily caught by a static analyser
  • 23% were “so wrong that any statement coverage testing or more careful code reviews by the developers would have caught the bugs”. Also “not-so-obvious error handling bugs were deterministic so would have shown up in basic unit test coverage which was missing”
  • 34% that were not-so-obvious and tricky to unit test are in “complex error handling logic”

Experienced developers know that complex error handling logic is a bad idea. Taking a “let-it-crash” approach avoids having any complex error handling. More about that and how it applies to Paxos below.

Other interesting facts about non-critical failures:

  • 77% can be reproduced by a unit test
  • 98% of failures are guaranteed to manifest on no more than 3 nodes
  • 84% will manifest on no more than 2 nodes
  • Most failures require no more than three input events to get them to manifest
  • A majority of failures log the triggering events are logged but the median number of log messages printed by each failure was 824

Let’s have a quick recap about whether Paxos supports simplistic error handling logic with 100% test cover. We are talking about bugs in the logic to handle what are commonly known as exceptions. We have likely all seen processes run out of memory. It is how we deal with such IO exceptions which is important. The study indicates that we should use very simplistic strategies with good test coverage. The best approach to such disk or memory is to crash the process and have it reinitialize from its journal.

Where we might find complexity is in lost, repeated, late, or reordered messages. This is where Paxos excels. Paxos is mathematically proven to be correct under messages lost, resent or reordered between nodes in a cluster. The “safe to resend messages” aspect of the algorithm is the very helpful property of idempotency. It means that the network doesn’t have to be reliable. The simplistic strategy to deal with lost messages, when those messages are idempotent, is to keep on resending messages until we eventually get back an acknowledgement. Simple. Trex does this and has test coverage for out-of-order and duplicated messages. Totally lost messages (such as network partitions) simply lead to a lack of forward progress until the network is fixed and the retransmitted messages finally get through.

For Paxos aficionados this is an interesting set of facts since the Google Chubby paper is often interpreted as saying something completely different. It is often read as saying that the subtlety of the mechanically simply Paxos algorithm led them having to do a lot of distributed randomised testing to find all their bugs.

Did Google just get unlucky? Or is it because they choose not to implement Paxos as an embeddable library? They were implementing consistency-as-a-service with clients observing strongly consistent state as a simplified file system metaphor. Anyone else seen outages due to faulty Zookeeper client callback logic adapting the simple file system metaphor back into an distributed application’s error handling logic? I digress.

What about disk errors? The only requirement for Paxos correctness is that a node doesn’t forget what promises it has made and the last values it accepted. That means that we have to force the disk before we send any messages. Trex has tests which confirm that the disk is flushed before any messages are sent. If we crash between forcing the disk and sending the message that is only a lost message which is safe. What if we cannot write to disk because the disk is full? The node should stop sending any messages and the cluster should timeout on it. How do we arrange that? Ensure that the thread which flushes the disk is also the thread that dispatches the messages. Simple. If it cannot flush the disk it never sends the messages. It is indistinguishable from a dead node. No zombie nodes here please.

Any node that has been cleanly shutdown for any length of time needs to catch up with rest of the state of the cluster. Only what is locally on disk is required to recover the last known state then connect to and catch up with the rest of the cluster. What happens after a node has locked-up or crashed? Fix the disk issues and restart it. The normal startup logic will recover the last good state and it will catch up with the cluster. That is the “let-it-crash” approach; in response to errors in IO which lead to an unknown system state the correct approach is to re-initialise. We run the re-initialisation logic during normal startup to it is heavily used code which can have good test coverage.

In short Paxos is perfectly positioned to avoid the sorts of bugs that are encountered on real-world distributed systems.

Edit: The CERN “Silent Corruptions” presentation about disk corruption strongly suggests a CRC check is a good idea when loading saved values from disk. Trex writes a CRC check to each value written to its Journal and checks it when reading it back.