Paxos Reconfiguration Stalls
In a previous post I covered using the Paxos engine itself to do cluster reconfiguration as per the 2001 Paxos Made Simple paper. In this post I will cover a problem with that technique know as pipeline stalls. This post is to set the scene for a new technical report published 2016 which fixes the problem with a state-of-the-art Paxos implementation called UPaxos.
First a recap on the subject of cluster reconfigurations. As discussed in a previous post on dynamic cluster membership the 2001 paper Paxos Made Simple uses the Paxos engine itself to pick a value that defines the next cluster configuration. The leader sends out a command which states that at the current slot “plus α” a new cluster configuration takes effect.
Why at slot plus α? To attempt to minimise the risk of stalls during concurrent fixing of client values. Concurrent fixing that sounds risky. Paxos for state machine replication enforces an ordering at all nodes. Won’t concurrent processing of commands with multiple threads violate that? Yes but that doesn’t necessarily lead to inconsistent nodes in typical real world systems. A typical key-value store doesn’t need strict global ordering. Only operations on any particular key must be ordered on every node. Operations on different keys may run in parallel on different threads without leading to inconstancies. One approach is to route all operations on the same key to the same thread on each node in leader order. We could hash any given key then modulo the thread pool size to compute index of the thread.
Okay so what is a good value for α in a typical application. That depends. With the original Paxos the slot at “plus α” becomes a fence across which operations cannot be reordered. The new cluster reconfiguration which comes into effect at α defines which nodes are in the quorum that fix subsequent slots. If the leader doesn’t see a quorum response for the new cluster reconfiguration by the time we get to slot α we cannot fix any further values as we don’t know which quorum is appropriate. We get a stop-the-world event. Bummer.
What if the network keeps on dropping the cluster reconfiguration message even when the leader repeated times out on responses and retransmits? By setting α to be very large we make it improbable that resent messages don’t get through in time. Improbable doesn’t mean impossible. Setting a very large α means that cluster reconfiguration operations can take a very long time to come into effect with no guarantee. This is annoying to say the least. In the next post we will look into how this is fixed by UPaxos.