Cluster Replication With Paxos
by simbo1905
In my last post I made the case for the defence of the Paxos Algorithm. In this post we will delve into a tough problem which Paxos solves; state replication across a cluster of replicas using atomic broadcast. It turns out that maintaining primary ordering on secondary servers is straightforward to achieve using Paxo without affecting correctness or efficiency. This may be a surprising result to any reader who has looked at ZAB and Raft. You don’t need an external leader election service as used in the Spinnaker or PaxStore papers. You send the highest committed, and highest proposed, log stream sequence counter during leader failover, and introduce optional, ignorable, retransmission requests. The techniques described in this article are used by the Trex Paxos library.
To be clear I am not claiming any new innovation or invention in this post; it is simply my attempt to demystify the subject. It is my assertion that it is possible to implement the protocol as described in the 2001 paper Paxos Made Simple to make a safe and efficient algorithm. For definitions of the messages and terminology used in this post please refer to that paper.
A common confusion is not comprehending that failover safety is designed into Paxos as proposal numbers must be unique to each leader; this is trivially achieved by encoding a node unique value in the least significant bits of the number used to compare messages. Remarkably even papers published in late 2013 miss this core feature of the algorithm and choose an unsafe global log index number as the Paxos number; which is actually a violation of the algorithm which then requires them to use a slow custom leader election mechanism to gain safety.
Problem Statement
Paxos is an algorithm for agreeing a value across a cluster. This is known as the consensus problem:
Assume a collection of processes which may propose a value. A consensus algorithm ensures that a single one amongst the proposed values is chosen [Paxos Made Simple]
The algorithm lets a majority of nodes agree one value at a time. Chaining values together into a meaningful protocol to provide a reliable service is left as an exercise to the reader.
It helps to have a simple example in mind so for the purposes of discussion assume we want to replicate a file-backed map as a trivial key-value datastore. In practice, any client-to-server network traffic can be replicated using the approach detailed in this post. Replication of our simple datastore will be by copying the stream of put(_,_)
or remove(_)
operations onto a set of backup hosts. One thing which may cause confusion is that your application may label “value” as meaning one thing (something held in a replicated map), but Paxos calls “value” the next command you are trying to get consistent across the cluster us as a put(_,_)
or remove(_)
.
In our map example above the operations do not commute; they need to be applied in the same order at every follower else their maps won’t match the leaders. If we label each command with a sequential index we can enforce the ordering. A sequential index also helps use detect gaps in the message stream due to lost messages so we can retransmit them.
Multi-Paxos Numbers For Replication
The description of multi-Paxos on Wikipedia (as at late October 2014) includes a counter:
To achieve [multi-Paxos], the instance number
I
is included along with each value. [wikipedia]
The paper Paxos Made Simple clarifies that it is the absolute counter of the number of instances of the algorithm. The leader sends accept(I,N,V)
where I
is the instance counter, N
is the proposal number unique to a leader and which stays stable between leadership fail-overs, and V
is the value being proposed by the leader for that instance. I am going to clarify the definition of the counter to be more specific:
Let
S
be the logical log stream position counter, that is included along with each value, which must be made contiguous and unique across all committed values at each node.
Each S
is logically a slot in commit history into which leaders propose values, for which a value must be chosen, and which must be committed in order. This counter definition is also used in the Spinnaker paper describing transaction log replication with multi-Paxos. Clients are not shown the effect of the command at each slot until after it is chosen and executed (committed) in log index order after every preceding command is executed (committed). An explicit log index allows for consensus to be performed in parallel on different slots using the same N
. This allows a leader to stream accept(S,N,V)
messages committing them in log order as they become fixed. It is a common misconception that the original Paxos papers don’t use a stable leader. In Paxos Made Simple on page 6 in the section entitled “The Implementation” Lamport wrote:
The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner.
This is simply achieved using the Phase 1 messaging of prepare and promises.
Due to leadership changes, a particular value may be proposed into a given slot S
by one leader then committed by another. Leader fail-overs can also cause different values to be proposed into the same slot by different leaders. Paxos requires that N
is unique to a cluster node so there cannot be a clash between two leaders. If during a network partition we have multiple nodes attempting to lead, who all got a different value accepted into the same slot by a minority, with no majority, the round is failed. Each leader should use a new higher N
at the next round to attempt to fix a value into that slot with a fresh accept(S,N',V)
message. This means that for any given {N,S}
pair there is only one unique proposed V
.
Retransmission
The existence of a contiguous log stream counter S
makes it easy for any follower to detect gaps in the transaction log stream. It can recover the lost messages by responding with a retransmission request to the leader (or any follower) stating its highest committed S
. All nodes should then keep a history of recently committed sequential {S,V}
with which they can respond. In addition to sending committed values during retransmission we can also speculatively send a second collection of the latest uncommitted accept(S,N,V)
messages above the highest committed slot. The receiving node should treat these speculative messages as if they came directly from the leader; it should ignore them if it has made a higher promise. This speculative retransmission will bring the requesting node fully up-to-date.
Choosing A Value: Fix It Then Commit It
We need to consider when a value is fixed, how that is learnt, and when it is committed at each follower. With Paxos a value cannot change when it has been accepted by a majority of the cluster. Becoming aware that a value has been chosen is achieved by the followers learning from the leader. This implies a commit message. Interestingly a leader may die before it sees a majority of accept responses; so a value may be chosen, and cannot change, but no-one is aware. This scenario will be covered in the next section. (A commit message is not covered in the original papers but is a common optimisation know as a Paxos phase 3 message.)
In the general case client commands don’t commute. The leader chooses the ordering which is referred to as “primary ordering”[1]. To achieve primary ordering a leader only issues a commit of a value at a log index S
when it learns that all the values lower than that log index have been fixed. It then issues a commit(S,N)
message. Recall that a leader should only propose a unique value for any {S,N}
pair. This means that a commit(S,N)
message identifies the unique accept(S,N,V)
message being committed at log index S
. Each Follower can hold a sequence of accept(S,N,V)
messages indexed by S
and when it receives the commit(S,N)
it can commit the value of the matching accept(S,N,V)
if it has been able to commit all values for slots S'< S
. When there is no message loss a follower will see interleaved accept and commit messages. The leader can stream accept messages such that there can be multiple outstanding values with the highest committed slot lagging behind. The challenge is only when the follower sees gaps in the S
numbers of either accept messages, or commit messages, due to lost or reordered messages.
Retransmission will be required if a node missed some accept messages as the node does not know the V
being committed at any given log index S
. If some commit messages were missed and the leader has not changed a follower may be able to deduce the values committed at the lower slots. This is because a leader cannot change its proposed value at slot S
for a given N
, and it could not have lost its leadership and regained it without incrementing N
, and it must have committed them in log order. The follower can then commit the preceding slots when it receives a commit(S,N)
where it holds an accept(S,N',V)
where N'= N
if they are contiguous with the last committed slot. Otherwise it can request retransmission of committed accept messages from the leader. This deduction of which values can be committed allows us to batch up many accept messages into a network frame. A single commit message at the front of the next batch of messages which names the last accept of the previous batch will implicitly commit all the values in the previous batch.
The Leader Take-Over Protocol
What happens for slots above the maximum committed log index slot where a dead leader issued accept messages that are not known to be committed? The new leader runs Paxos full-rounds to fix values into these slots. It issues a prepare(N,S)
for all slots not learnt to have been committed. Each follower responds with the highest uncommitted {N,V}
pair at each slot. The new leader then selects the value with the highest N
at each slot S
which it fixes by sendings out a fresh accept message. The fresh accept messages use the higher N
number of its prepare messages. When it gets a positive majority accept response it commits each slot. The Spinnaker paper refers to this as the leader takeover phase.
If the new leader is not aware of any uncommitted slots it can send a prepare for the slot just higher than the last it committed. It may be the case that the new leader missed some accept messages from the previous leader and so is not aware of the full range of slots it needs to fix. One node in the majority knows the highest slot index proposed by the last leader that must be fixed. The responses to the prepare message can state the highest accepted log index at each node. This allows the new leader to learn the full range of slots it needs to fix. It can then send out additional prepare messages as required.
Probing and filling in the previous leaders uncommitted slots is a form of crash recovery. The Paxos Made Simple paper says that if the new leader finds a slot with no accept message due to lost messages it should fix a no-op value into that slot. This does not affect the correctness as the new leader is free to choose any value; it just speeds up recovery. There is the question of what N
number to use in the prepare messages sent during the leader takeover. Normally the last promise will be greater than or equal to the last commit. Message retransmission may cause a fast-forward commit such that the last promise number is lower than the last commit. So the new leader should choose a number higher than the maximum number it last promised or last committed.
At the point where a new leader takes over some nodes may not have the full commit history. One node in any majority must have the full history. The new leader can learn who is up-to-date in the responses to the prepare messages which can state the highest committed log index at each node. If the new leader is not up to date it can lead: yet probing historic slots that are known to be committed would slowdown recovery. It can request retransmission from one of the fully up-to-date nodes to get caught-up. If the prepare messages are rejected due to a higher promises by other nodes this can also be learnt from the negative acknowledgements. At the next timeout the node performing takeover can re-issue fresh prepare using a higher number than previously encountered. (Negative acknowledgements are not covered in the original Paxos paper but have been formally studied as a standard optimisation.)
Leader Failover
Leader failover requires the detection of a leader’s failure. This can be based on a timeout mechanism. Recall that proposal numbers N
must be unique to each node. This can be achieved by encoding a node unique value into the least significant bits. Paxos then makes it safe for any number of nodes to timeout simultaneously. The node with the highest unique bits to timeout which can contact a majority of nodes will lead; although there is no guarantee that a leader will make good progress if other nodes repeatedly timeout and interrupt. Randomising the timeout duration at each occurrence will reduce the probability of wasted messages due to multiple nodes attempting to lead simultaneously. An exponential backoff weighted by each node’s unique number will also reduce the probability of an extended leader battle.
A timeout strategy requires a heartbeat message. Heartbeating the last commit message made by the leader removes the need to have a commit response message. A follower who has missed messages will request retransmission of committed values when it sees the next heartbeat commit message.
Indirect evidence of a stable leader is also desirable. If a single link goes down between a follower and a stable leader then it would be disruptive if the follower issues a high prepare message as it may interrupt a stable leader. Worse yet under a complex network partition the timed out node may not itself be able to lead.
Consider a three node cluster where the link goes down between the leader and one follower. The follower will timeout and issue a low prepare it has the opportunity to see that the node it can connect with is in touch with the stable leader. The timed-out node can then refrain from inducing an unnecessary failover. It can stay up to date by sending retransmit requests to the reachable follower in the working majority until the network is healed. Edit: Seeing only heartbeat commits flowing around a partial partition is only evidence that traffic can flow in one direction it does not confirm that traffic can flow back to the leader.
In a five node cluster a timed-out node can receive two response giving it a majority of three. If one of those responses shows the old leader heartbeating then it can see that the leader is a fourth node behind a network partition. It is missing information about the “fifth node” needed to determine if the surviving leader has a working majority. If the fifth node is dead the cluster could halt if the timed out node does not take over. If the fifth node is alive we risk duelling leaders and the partial network partition must be healed to restore the cluster to normal working order.
The “fifth node” corner case just described can be overcome by a stable leader heartbeating no-operation writes (or strongly consistent reads as described in the next article). Then any timed-out node which can form its own overlapping majority can observe the increasing committed slot index through an overlapping node. This provides concrete evidence of a working majority behind a partial network partition to any timed-out node. If it cannot see such evidence of a working majority in a majority response to its low prepare it should then go ahead and execute the leader take-over protocol. Edit: This technique is similar to the one described in the later post Pre-voting in distributed consensus
The next post will cover whether read operations have either strong or weak consistency. There is now sourcecode which implements multi-Paxos as described above over on GitHub.
Edit: I recently came across this overview of consensus algorithms. The above description and the code on GitHub does “active consensus for Multi-Consensus-Prefix-Ordering”. The ordering is trivially achieved by not calling up to the host application out-of-order.
[1] In applications where operations are known to commute we can run multiple primary ordered versions of the algorithm in parallel by pipelining.
[…] being selected and it coming into effect. In the transaction log replication scheme described in a previous post it is natural to set `α` to be 1. The change come into effect for the next Paxos round. This is […]
“Chaining values together into a meaningful protocol to provide a reliable service is left as an exercise to the reader.” – Is the exercise going to be multi-paxos for this that is described above?
Apologies that my words here were meant to be dark humour. The phrase “left as an exercise to the reader” as explained on this answer http://english.stackexchange.com/a/60960 is often used in textbooks to force the student to work it out themselves. My intention was to make a joke by suggesting that Leslie Lamport had deliberately left out pragmatic details of how to apply Paxos to real world problems to force people to understand the problem of consensus and his solution better by being forced to work out the details for themselves. Removing the attempt to make a joke this sentence could be rewritten as “Actually applying this algorithm to perform replication is not explained in the original papers.”
See also my latest post at https://simbo1905.wordpress.com/2016/01/02/paxos-uses-leaders-multi-paxos-is-paxos/ which says that multi-paxos is paxos which is the algorithm I am describing above.
Nice article, thanks for sharing it.
Instead of a separate heartbeat mechanism, I have also found it simpler just to commit no-ops periodically. The leader is then taken to be the owner of the value in the last-committed instance (as long as it was committed sufficiently recently). This clearly demonstrates that the leader can contact a quorum of acceptors, which is an important property for the heartbeat mechanism to have – in fact this idea grew out of adding that functionality separately and then realising that it was basically the same as the commit logic, as you rightly point out when talking about weird failure modes in the 5-node case. It also means that under load you don’t need to worry about heartbeats as the fact that values are being committed is enough to indicate the leader is alive. One advantage that you don’t mention here is that it also gives you the ability to perform an abdication (a deliberate transfer of leadership to another specific node) which is much quicker than a timeout-based failover when you are deliberately taking a node down for maintenance.
Your point about ensuring leader stability is a good one too. We do this by only starting to become leader once we are in touch with a quorum of other timed-out nodes, which is subtly different from your technique of trying to become leader if you have no evidence of a working quorum elsewhere: if you’re on the minority side of a partition then there’s no point in taking any further action.
Cheers,
You are welcome!
I agree with your points about committing a no-op as a heartbeat is more efficient as seeing a no-op commit proves a leader can commit as it is in contact with a quorum. That future feature is listed on the Trex README.md on GitHub as
noop heartbeats (less duels and partitioned leader detection)
. That is probably not the most descriptive of name the technique.I am not sure I understand your points about abdication. Could you explain a little bit more?
I am not sure I understand the wisdom of trying to “ensure you are in touch with a quorum of nodes” when you can just send out a Paxos prepare to learn whether the you are in touch with other nodes. Can you elaborate a bit more about how you ensure a node can contact a quorum before it needs to when it times out on no-op commits?
I just re-release the post and realised that I was talking about the technique of sending noop values that you highlight when I wrote about
heartbeating no-operation writes
! So I will try to reword that paragraph.On abdication: if you want to perform maintenance on the leader, you could just take it down and let the remaining nodes sort out a new leader amongst themselves, but this involves a window of unavailability waiting for one of them to time out and complete the leader election. Unavailability due to failures is unavoidable in general but it is undesirable for maintenance activities. You can avoid the timeout delay by having the leader explicitly abdicate to another node which becomes leader without having to timeout first.
To achieve this, we define the leader to be simply the owner of the most-recently-chosen proposal (as long as it was chosen recently enough). Abdication is performed by having the outgoing leader send out a PREPARE (aka phase 1a) message for a proposal that is owned by the new leader. The responses (PROMISE aka phase 1b messsages) go to the new leader and once it has received a majority it immediately starts to propose values, which makes it the leader.
It is perhaps not totally obvious that sending out a phase 1a message on behalf of another node is safe, but if you read the proof carefully you’ll see that phase 1a messages actually have very little to do with safety: they are really only there for liveness.
With your description I am afraid I have more questions about the idea of “owning” messages and their routing. Perhaps a detailed blog post on the subject could help me to visualise how to do high performance take-overs?
I have no blog – I keep meaning to start one but can’t promise it’ll be any time soon 😦
I suspect it’s just a terminology problem – I call them proposals but you’re calling them ballots (in TRex at least). Replace that word and see if it helps. The owner of a proposal (i.e. ballot) is just the `nodeIdentifier` in its `BallotNumber` [1] to use the TRex terminology.
I think ref [2] points to the line that routes the PROMISE message (aka `PrepareAck`) back to the original sender of the PREPARE message. The change to routing is just to send the PROMISE back to the _owner_ of the proposal instead. Mostly they will be equal, but when abdicating they’re not. This is most of what you need to do in order to send out a PREPARE on behalf of another node.
[1] https://github.com/trex-paxos/trex/blob/master/library/src/main/scala/com/github/trex_paxos/library/PaxosProtocol.scala#L57
[2] https://github.com/trex-paxos/trex/blob/master/library/src/main/scala/com/github/trex_paxos/library/PaxosProtocol.scala#L165
Quote: I am not sure I understand the wisdom of trying to “ensure you are in touch with a quorum of nodes” when you can just send out a Paxos prepare to learn whether the you are in touch with other nodes.
Good question, it’s quite subtle and there may be other solutions, but it was a struggle for us to prove liveness without doing this.
The issue we had with going straight to sending out a PREPARE message is to do with choosing the proposal number to include in the message. If your cluster is leaderless then it’s no problem, you just pick one larger than any you’ve previously seen, but if the cluster has a leader (that you cannot contact) then this causes problems: you will send this message to yourself and send yourself a PROMISE back, after which point you can no longer take part in votes for the proposal number that the leader is currently using, even if you are reconnected, because you’ve promised a larger proposal number. To bring you back into full service the leader has to re-run phase 1 for a still-larger proposal once you are reconnected.
This is not too bad if you’re completely isolated but it causes some issues if you are in a 2-node partition of a 5-node cluster as you will each enter a loop of sending PREPARE messages with higher and higher numbers without ever achieving a quorum. Hopefully you don’t run out of numbers!
We also had issues related to partial partitions and other general flakiness where you could end up with the leader repeatedly running phase 1 for increasingly large proposals in an apparently stable fashion even after the partition was healed. We fixed the cases we found, but could not prove we had caught all the cases without limiting the PREPARE messages only to be sent once we had evidence that the whole cluster had lost its leader.
As I said, there may be other solutions, but this was the simplest one we could think of that supported a decently straightforward liveness proof.
Cheers,
David
And in terms of ‘how’: when nodes time out they broadcast an I-AM-STUCK message and they also start collecting I-AM-STUCK messages from their peers. They only send a PREPARE message once they have received enough I-AM-STUCK messages. We also expire these messages after a small multiple of the timeout period as we are trying to determine whether a node can _currently_ contact a quorum, and without expiry a shifting partition could confound this.
The other possible response to an I-AM-STUCK is I-CAN-HELP which comes from a node that is still making progress and can provide a more up to date copy of the cluster state. If one of those arrives then a node knows not to try and become leader until it has caught up. These messages also expire, in case the only helpful node dies before it can be of assistance.
Forgot to mention above that avoiding re-running phase 1 is important for performance, because it causes a pipeline stall. Ideally a network wobble with a minority of nodes shouldn’t affect performance in this way.
[…] David Turner (@DaveC… on Cluster Replication With … […]
may i ask what would you do in such client involved scenario.
lets say we have 5 nodes, n1, n2, n3, n4, n5, and n1 holds the leadership.
and client wants to write such incr command, lets say “incr key=a”
1. n1 sends accept requests to all of nodes, lets say
accept(pn, index=15, value=”incr key=a”), the index represents the 15th
slot
2. all of nodes receive the accept request, and send back the accept ack
message to n1
3. but due to network latency, n1 would not receive any one of ack messages
and n1 will catch timeout exception, and n1 send fail state to client.
4. client knows that this write operation is fail, may retry again
5. now n1 crashes, then n2 becomes the new leader. and n2 knows that
there is a accepted value in slot 15, so n2 would choose that value, and
execute incr command
6. client issues a write operation to write the incr command again, then n2
would choose it, so the incr command will be executed twice.
In step 4, client knows that this write operation is fail, but in step 5, n2 will
choose the command, then we have unsynchronized.
Even if we respond to client until we have the value choosen(majority of
nodes mark the value as choosen not only accetped), we still have
the state that client is notified and the eventual state in our cluster are
unsynchronized.
On paxos, thats ok. beacuse the command will be choosen eventually, and
cant be changed anymore.
We can ask client to do something to address such unsynchronized problem.
client would assign a request_id to every choose request. if client fails in
writing some commands, then issues them again with the request_id, and
n2 will look up the request_id before it sends accept requests.
if n2 finds out that there is a write operation with such request_id is finished,
then return success to client, or return fail to client.
what is your opinion?