Cluster Replication With Paxos

by simbo1905

In my last post I made the case for the defence of the Paxos Algorithm. In this post we will delve into a tough problem which Paxos solves; state replication across a cluster of replicas using atomic broadcast. It turns out that maintaining primary ordering on secondary servers is straightforward to achieve using Paxo without affecting correctness or efficiency. This may be a surprising result to any reader who has looked at ZAB and Raft. You don’t need an external leader election service as used in the Spinnaker or PaxStore papers. You send the highest committed, and highest proposed, log stream sequence counter during leader failover, and introduce optional, ignorable, retransmission requests. The techniques described in this article are used by the Trex Paxos library.

Fork me on GitHub

To be clear I am not claiming any new innovation or invention in this post; it is simply my attempt to demystify the subject. It is my assertion that it is possible to implement the protocol as described in the 2001 paper Paxos Made Simple to make a safe and efficient algorithm. For definitions of the messages and terminology used in this post please refer to that paper.

A common confusion is not comprehending that failover safety is designed into Paxos as proposal numbers must be unique to each leader; this is trivially achieved by encoding a node unique value in the least significant bits of the number used to compare messages. Remarkably even papers published in late 2013 miss this core feature of the algorithm and choose an unsafe global log index number as the Paxos number; which is actually a violation of the algorithm which then requires them to use a slow custom leader election mechanism to gain safety.

Problem Statement

Paxos is an algorithm for agreeing a value across a cluster. This is known as the consensus problem:

Assume a collection of processes which may propose a value. A consensus algorithm ensures that a single one amongst the proposed values is chosen [Paxos Made Simple]

The algorithm lets a majority of nodes agree one value at a time. Chaining values together into a meaningful protocol to provide a reliable service is left as an exercise to the reader.

It helps to have a simple example in mind so for the purposes of discussion assume we want to replicate a file-backed map as a trivial key-value datastore. In practice, any client-to-server network traffic can be replicated using the approach detailed in this post. Replication of our simple datastore will be by copying the stream of put(_,_) or remove(_) operations onto a set of backup hosts. One thing which may cause confusion is that your application may label “value” as meaning one thing (something held in a replicated map), but Paxos calls “value” the next command you are trying to get consistent across the cluster us as a put(_,_) or remove(_).

In our map example above the operations do not commute; they need to be applied in the same order at every follower else their maps won’t match the leaders. If we label each command with a sequential index we can enforce the ordering. A sequential index also helps use detect gaps in the message stream due to lost messages so we can retransmit them.

Multi-Paxos Numbers For Replication

The description of multi-Paxos on Wikipedia (as at late October 2014) includes a counter:

To achieve [multi-Paxos], the instance number I is included along with each value. [wikipedia]

The paper Paxos Made Simple clarifies that it is the absolute counter of the number of instances of the algorithm. The leader sends accept(I,N,V) where I is the instance counter, N is the proposal number unique to a leader and which stays stable between leadership fail-overs, and V is the value being proposed by the leader for that instance. I am going to clarify the definition of the counter to be more specific:

Let S be the logical log stream position counter, that is included along with each value, which must be made contiguous and unique across all committed values at each node.

Each S is logically a slot in commit history into which leaders propose values, for which a value must be chosen, and which must be committed in order. This counter definition is also used in the Spinnaker paper describing transaction log replication with multi-Paxos. Clients are not shown the effect of the command at each slot until after it is chosen and executed (committed) in log index order after every preceding command is executed (committed). An explicit log index allows for consensus to be performed in parallel on different slots using the same N. This allows a leader to stream accept(S,N,V) messages committing them in log order as they become fixed.

Due to leadership changes, a particular value may be proposed into a given slot S by one leader then committed by another. Leader fail-overs can also cause different values to be proposed into the same slot by different leaders. Paxos requires that N is unique to a cluster node so there cannot be a clash between two leaders. If during a network partition we have multiple nodes attempting to lead, who all got a different value accepted into the same slot by a minority, with no majority, the round is failed. Each leader should use a new higher N at the next round to attempt to fix a value into that slot with a fresh accept(S,N',V) message. This means that for any given {N,S} pair there is only one unique proposed V.


The existence of a contiguous log stream counter S makes it easy for any follower to detect gaps in the transaction log stream. It can recover the lost messages by responding with a retransmission request to the leader (or any follower) stating its highest committed S. All nodes should then keep a ring buffer of recently committed sequential {S,V} with which they can respond. The data-structures used to hold accept messages in log order prior to being committed can be re-used for this purpose. In addition to sending committed values during retransmission we can also speculatively send a second collection of the latest uncommitted accept(S,N,V) messages above the highest committed slot. This will bring the requesting node fully up-to-date.

Choosing A Value: Fix It Then Commit It

We need to consider when a value is fixed, how that is learnt, and when it is committed at each follower. With Paxos a value cannot change when it has been accepted by a majority of the cluster. Becoming aware that a value has been chosen is achieved by the followers learning from the leader. This implies a commit message. Interestingly a leader may die before it sees a majority of accept responses; so a value may be chosen, and cannot change, but no-one is aware. This scenario will be covered in the next section.

In the general case client commands don’t commute.  The leader chooses the ordering which is referred to as “primary ordering”[1]. To achieve primary ordering a leader only issues a commit of a value at a log index S when it learns that all the values lower than that log index have been fixed. It then issues a commit(S,N) message. Recall that a leader should only propose a unique value for any {S,N} pair. This means that a commit(S,N) message identifies the unique accept(S,N,V) message being committed at log index S. Each Follower can hold a sequence of accept(S,N,V) messages indexed by S and when it receives the commit(S,N) it can commit the value of the matching accept(S,N,V) if it has been able to commit all values for slots  S'< S. When there is no message loss a follower will see interleaved accept and commit messages. The leader can stream accept messages such that there can be multiple outstanding values with the highest committed slot lagging behind. The challenge is only when the follower sees gaps in the S numbers of either accept messages, or commit messages, due to lost or reordered messages.

Retransmission will  be required if a node missed some accept messages as the node does not know the V being committed at any given log index S. If some commit messages were missed and the leader has not changed a follower may be able to deduce the values committed at the lower slots. This is because a leader cannot change its proposed value at slot S for a given N, and it could not have lost its leadership and regained it without incrementing N, and it must have committed them in log order. The follower can then commit the preceding slots when it receives a commit(S,N)  where it holds an accept(S,N',V) where N'= N if they are contiguous with the last committed slot. Otherwise it can request retransmission of committed accept messages from the leader. This deduction of which values can be committed allows us to batch up many accept messages into a network frame. A single commit message at the front of the next batch of messages which names the last accept of the previous batch will implicitly commit all the values in the previous batch.

The Leader Take-Over Protocol

What happens for slots above the maximum committed log index slot where a dead leader issued accept messages that are not known to be committed? The new leader runs Paxos full-rounds to fix values into these slots. It issues a prepare(N,S) for all slots not learnt to have been committed. Each follower responds with the highest uncommitted {N,V} pair at each slot. The new leader then selects the value with the highest N at each slot S which it fixes by sendings out a fresh accept message. The fresh accept messages use the higher N number of its prepare messages. When it gets a positive majority accept response it commits each slot. The Spinnaker paper refers to this as the leader takeover phase. 

If the new leader is not aware of any uncommitted slots it can send a prepare for the slot just higher than the last it committed. It may be the case that the new leader missed some accept messages from the previous leader and so is not aware of the full range of slots it needs to fix. One node in the majority knows the highest slot index proposed by the last leader that must be fixed. The responses to the prepare message can state the highest accepted log index at each node. This allows the new leader to learn the full range of slots it needs to fix. It can then send out additional prepare messages as required.

Probing and filling in the previous leaders uncommitted slots is a form of crash recovery. The Paxos Made Simple paper says that if the new leader finds a slot with no accept message due to lost messages it should fix a no-op value into that slot. This does not affect the correctness as the new leader is free to choose any value; it just speeds up recovery. There is the question of what N number to use in the prepare messages sent during the leader takeover. Normally the last promise will be greater than or equal to the last commit. Message retransmission may cause a fast-forward commit such that the last promise number is lower than the last commit. So the new leader should choose a number higher than the maximum number it last promised or last committed.

At the point where a new leader takes over some nodes may not have the full commit history. One node in any majority must have the full history. The new leader can learn who is up-to-date in the responses to the prepare messages which can state the highest committed log index at each node. If the new leader is not up to date it cannot lead. It can request retransmission from one of the fully up-to-date nodes to get caught-up. If the prepare messages are rejected due to a higher promises by other nodes this can also be learnt from the negative acknowledgements. At the next timeout the node performing takeover can re-issue fresh prepare using a higher number than previously encountered.

Leader Failover

Leader failover requires the detection of a leader’s failure. This can be based on a sloppy timeout mechanism. Recall that proposal numbers N must be unique to each node. This can be achieved by encoding a node unique value into the least significant bits. Paxos then makes it safe for any number of nodes to timeout simultaneously. The node with the highest unique bits to timeout which can contact a majority of nodes will lead; although there is no guarantee that a leader will make good progress if other nodes repeatedly timeout and interrupt. Randomising the timeout duration at each occurrence will reduce the probability of wasted messages due to multiple nodes attempting to lead simultaneously. An exponential backoff weighted by each node’s unique number will also reduce the probability of an extended leader battle.

A timeout strategy requires a heartbeat message. Heartbeating the last commit message made by the leader removes the need to have a commit response message. A follower who has missed messages will request retransmission of committed values when it sees the next heartbeat commit message.

Indirect evidence of a stable leader is also desirable. If a single link goes down between a follower and a stable leader then it would be disruptive if the follower issues a high prepare message as it may interrupt a stable leader. Worse yet under a complex network partition the timed out node may not itself be able to lead. To avoid disruption a timed-out follower can first issue a low prepare which will be rejected by all nodes so it will not interrupt a stable leader. The rejection message can indicate whether other nodes are seeing leader heartbeats. If the timed-out node receives evidence that a majority are seeing fresh leader heartbeats it should refrain from attempting to take-over the leadership. If there is no evidence of a stable leader in a majority response to a low prepare it can then safely execute the leader takeover phase. Incomplete evidence will be covered below.

Consider a three node cluster where the link goes down between the leader and one follower. The follower will timeout and issue a low prepare and see the second follower is receiving heartbeats from the stable leader. The timed-out node has a majority with itself and the other follower which it can deduce overlaps with a working majority of the stable leader behind the partial network partition. The timed-out node can then refrain from inducing an unnecessary failover. It can stay up to date by sending retransmit requests to the reachable follower in the working majority until the network is healed.

With clusters larger than three nodes it is possible to get a complex network failure where a timed-out node receives insufficient responses to correctly determine the existence of a working majority behind a partial network partition. In a five node cluster a timed-out node can receive two response giving it a majority of three. If one of those responses shows the stable leader heartbeating then it can see that the leader is a fourth node behind a network partition. It is missing information about the “fifth node” needed to determine if the surviving leader has a working majority. If the fifth node is dead the cluster could halt if the timed out node does not take over. If the fifth node is alive we risk duelling leaders and the partial network partition must be healed to restore the cluster to normal working order.

The “fifth node” corner case just described can be overcome by heartbeating no-operation writes (or strongly consistent reads as described in the next article). Then any timed-out node which can form it’s own overlapping majority can observe the increasing committed slot index through an overlapping node. This provides concrete evidence of a working majority behind a partial network partition to any timed-out node. If it cannot see such evidence of a working majority in a majority response to its low prepare it should then go ahead and execute the leader take-over protocol.

The next post will cover whether read operations have either strong or weak consistency. There is now sourcecode which implements multi-Paxos as described above over on GitHub.

Edit:  I recently came across this overview of consensus algorithms.  The above description and the code on GitHub does “active consensus for Multi-Consensus-Prefix-Ordering”. The ordering is trivially achieved by not calling up to the host application out-of-order.

[1] In applications where operations are known to commute we can run multiple primary ordered versions of the algorithm in parallel by pipelining.