UPaxos: Unbounded Paxos Reconfigurations

by simbo1905

The year 2016 turned out to be a bumper year for pragmatic Paxos discoveries. Hot on the heels of the FPaxos discovery of more flexible quorums comes Unbounded Pipelining in Dynamically Reconfigurable Paxos Clusters or “UPaxos”. This uses overlapping quorums between consecutive cluster configurations, and a leader “casting vote”, to enable cluster reconfigurations in a non-stop manner even when reconfiguration messages are lost.

The UPaxos paper describes a state-of-the-art general-purpose Paxos implementation packed with solid techniques. In this post we will look at the key finding of eliminating the pipeline stall problem.

To understand the safety of UPaxos it is helpful to recap on overlapping quorums as generalised by FPaxos. We know from FPaxos that we can use separate quorums for Prepare and Accept messages as long as they overlap. Why? If a leader transmits an Accept, and then fails, the next leader will issue a Prepare, which needs to see at least one response containing the highest accepted value of the old leader. Why? Because Paxos is a collaborative algorithm. Any new leader always “chooses” the highest numbered value of the old leader. Intuitively we can say that the new leader “completes” any consensus work started by the last leader. UPaxos takes this to the next level by requiring that quorums overlap between consecutive reconfigurations. The stroke of genius in the UPaxos paper is exactly how we can use this additional safety property to avoid pipeline stalls.

With Paxos a cluster configuration is a pair of quorum sets \langle Q^I,Q^{II} \rangle where Q^I is the set of nodes used for “Phase I” Prepare messages and Q^{II} is the set of nodes used for “Phase II” Accept messages. The general safety requirement is that the two sets overlap which is written Q^I \frown Q^{II}. The original Paxos papers used a simple majority of nodes for both phases which clearly overlap. Using distinct but overlapping sets for the two phases is a new idea in 2016 which the FPaxos paper made famous.

Changes to the cluster configuration are modelled by a sequence of configurations \langle Q_0^I,Q_0^{II}\rangle,\langle Q_1^I,Q_1^{II}\rangle \ldots. The integer subscript is called the era and it is the index of a particular configuration in the sequence of configurations. As with the original Paxos paper we change the cluster configuration by treating it as a command value that we run through the algorithm. When the cluster reconfiguration command value is known to be fixed it is appended to the local copy of the sequence of configurations at each node. Another way to say this is that we use the Paxos engine itself as the atomic broadcast mechanism for new cluster configurations. As stalls are not a problem for UPaxos the cluster configuration comes into effect immediately. As we saw in the previous post with traditional Paxos that would harm concurrency when messages are lost.

The UPaxos paper has a logical proof of the additional safety requirement used to eliminate the possibility of stalls. The additional requirement is an overlap between the phase I and phase II quorums of successive configurations. This is written as Q_e^I \frown Q_{e+1}^{II} for all e. It’s worth noting that if we use straight-forward majorities, and we either add or remove a single node, then majorities in consecutive configurations will overlap.

We can talk about the era of a slot as e \langle s \rangle . This indicates the phase II quorum to fix a slot. Ballot numbers also have an era e \langle b \rangle . How? We encode the era number into the most significant bits of the ballot number. We can refer to the prepare quorum of a ballot number as Q_{e \langle b \rangle}^{I}.We say that the cluster is “in era e” while instances are being chosen using \langle Q_{e}^{I},Q_{e}^{II}\rangle. The paper states that whilst a change from era e to e + 1 is in progress some instances may use the interim configuration \langle Q_{e}^{I},Q_{e+1}^{II}\rangle, and once the change from era e to e+1 is complete all instances will use \langle Q_{e+1}^{I},Q_{e+1}^{II}\rangle. The paper distills the safety requirement down to e \langle s \rangle \geq e \langle b \rangle + 1. Which basically says that if a leader has received sufficient promises for a ballot number in a given era it can use it fix a slot in both the same or next era.

Okay, enough math already. How do we use this to do a reconfiguration without a stall? The main technique is the “leader casting vote”. Let’s walk through a simple reconfiguration. By way of a warm-up consider the following diagram where there is no reconfiguration to establish a baseline. In this figure a three node cluster comes up and gets to normal running using simple majority quorums:


That is the simplest case in vanilla Paxos. A node times out and becomes the leader. It issues a prepare using a suitably high ballot number b. When it hears back a quorum response it then streams client commands v1 and v2 with two consecutive accept messages. The messages target slots s and s+1. The diagram only shows two client commands but the leader can continue indefinitely.

Okay so what happens during a UPaxos reconfiguration? In the following diagram, we move between two eras without changing any physical nodes. Why? Well, it does help to make the diagram less messy. There is actually a very good reason to keep the same physical nodes but to change their voting weights which is covered in the UPaxos paper. I will cover that in a later post. So for the moment assume that we are just doing this to demonstrate the principle of how stalls are eliminated.


At the first message the leader is still in steady state mode (the bottom of the previous diagram). In this mode it happens to stream a client command labeled v1 under ballot number b in the original era e. Then at the upper dotted line the administrator commands a reconfiguration to move the cluster into era e+1. The leader dutifully streams that reconfiguration command, rather than a regular client command, into the next slot. When the leader hears a quorum response to the reconfiguration command at the second dotted line the new configuration is in effect. Then things get interesting.

Once the leader knows the new era is fixed at slot s+1 it needs to upgrade to a fresh ballot number in the new era. It does this by having a majority of nodes promise to a new ballot number in the new era. Only once it has done this will it have completed the reconfiguration. Until it has upgraded to a new ballot number in the new era it will not be able to process any further reconfiguration commands.

To upgrade to a fresh ballot number belonging to the new era the leader introduces an asymmetry by entering into a “leader overlap mode”. It splits the cluster into two quorums. A prepare quorum Q^{I}  with Follower 1 and an accept quorum Q^{II} with Follower 2. It only sends the prepare message containing the new ballot number b’ to Follower 1. Meanwhile, it continues to stream client values to Follower 2 using the old ballot number? Why?

The leader does this to create the circumstances where it has the “casting vote” as to whether the Q^{I} quorum has made sufficient promises to the new ballot number. It does this by acting last within the prepare quorum. It defers from making a promise to its own new ballot number until it knows that sufficient other nodes have also made promises to give it the “casting vote”. The leader doesn’t risk disrupt a majority of nodes in the Q^{II} quorum by sending any b’ prepare messages until it has cast its own vote and has a majority. Rather it continues to stream client commands to the Q^{II} quorum. Why?

This is because Paxos is designed to work even when messages are delayed or reordered. If it was to steam a prepare b’ to nodes in the Q^{II} it may be reordered with respect to client commands it previously streamed using b. That will cause the client commands to be rejected forcing the leader to resend. By holding the “casting vote” in the  Q^{I} quorum the leader knows exactly the point at where it instantaneously promises to b’ itself. That is the exact point where it will switch to using b’ for subsequent client commands. As those are local actions that won’t be reordered.

What happens if the leader crashes at any point during the reconfiguration? No special logic need to recover from a leader crash. The normal logic of the leader takeover phase will work as normal. Regardless of whether the next leader is in the old or new era it will “see” any values which it needs to complete.

I should note that some less reputable consensus algorithms dismiss reordering as a problem for strong consistency. They assume that TCP connections will keep messages in order. That gets tricky when you have dropped connections and reconnects so you likely still need to be tracking what’s going on above the network layer. Yet UDP has lower overhead without attempting to order messages. The new super fast “kernel bypass” RDMA networking technologies that take networking down to microseconds have a UDP mode; exactly to avoid the overhead of ordering. When we consider that the network might send datagram packets between two data centre resilience zones via redundant routes then we see how reordering can occur. The whole point of Paxos is to generate consensus across of an unreliable and asynchronous network that doesn’t guarantee messaging ordering. That works with redundant routes with arbitrary delays where messaging ordering isn’t a guaranteed.