Paxos For Master Leases
A topic not yet covered on this blog series on Paxos are leader/master leases. A quick search through The Part-Time Parliament paper for ‘leases’ won’t find anything; you need to search for “cheese inspector”.
As Paxos prospered, legislators became very busy. Parliament could no longer handle all details of government, so a bureaucracy was established. Instead of passing a decree to declare whether each lot of cheese was fit for sale, Parliament passed a decree appointing a cheese inspector to make those decisions. [The Part-Time Parliament]
It then goes on to say that you pass a decree which grants a given authority over a time duration:
A decree making ∆ ̆ικστρα the cheese inspector might read
2716: 8:30 15 Jan 72—∆ ̆ικστρα is cheese inspector for 3 months
This declares his term to begin either at 8:30 on 15 January or when the previous inspector’s term ended—whichever was later. [The Part-Time Parliament]
Here a decree is an entry in the multi-paxos ledger. Which is to say you use the consensus algorithm to assign a lease over a certain role to a given process in a fault tolerant and consistent manner.
Given that that paper is a notorious allegory we would hope to go to the clarification paper Paxos Made Simple to get some more information about how to use leases. That clarification paper makes no references to leases which confirms that they are not a core part of the Paxos algorithm. Leases are discussed in the 1999 paper How to Build a Highly Available System Using Consensus which brought Paxos to wide attention. This paper was published a couple of years before the Paxos Made Simple paper when the original paper was an unpublished technical report at Microsoft. It very clearly mentions and discusses leases as:
Fault-tolerant consensus is expensive. Exclusive access by a single process (also known as locking) is cheap, but it is not fault-tolerant—if a process fails while it is holding a lock, no one else can access the resource. Adding a timeout to a lock makes a fault-tolerant lock or ‘lease’. Thus a process holds a lease on a state component or ‘resource’ until an expiration time; we say that the process is the ‘master’ for the resource while it holds the lease. No other process will touch the resource until the lease expires. [How to Build a Highly Available System Using Consensus]
It goes on to clarify (emphasis mine):
A lease is most often used to give a process the right to cache some part of the state, for instance the contents of a cache line or of a file, knowing that it can’t change. Since a lease is a kind of lock, it can have a ‘mode’ which determines what operations its holder can do. If the lease is exclusive, then its process can change the leased state freely. This is like ‘owner’ access to a cache line or ownership of a multi-ported disk. [How to Build a Highly Available System Using Consensus]
We will see below that being able to return cached values when holding a lease is helpful. Simply put if you hold the lease no-one else can be updating the same state so you can always return the latest value.
So how exactly can we use leases to speed up Paxos? Can we use leases to speed up safe writes under Paxos? No not directly. If we are performing a write to the cluster, and we don’t exchange messages with a majority of nodes in the cluster, and those nodes don’t flush the write value to disk, then we don’t have a fault tolerant write guaranteed to be consistent across the cluster with the consistency guarantees proven for Paxos. We cannot use leases to relax the constraints needed to perform consensus on a safe write.
We can of course make an application which doesn’t need full fault tolerance for all writes. Many NoSQL databases let you pick how many nodes a write is written to before the write is acknowledged to the client. Such databases typically scale to dozens of server nodes. It would inhibit scalability to send every client write to a majority of nodes of a large cluster. It can be perfectly acceptable to acknowledge a write from a client when it has been replicated to, say, three nodes. Paxos can be used to manage cluster membership and to issue leases for different nodes to master differnet key ranges. Each key range master can then acknowledge a client write to a key it owns once it has replicated it to two additional nodes in the cluster. We then have all client data written to three nodes (a different three nodes for different key ranges) and we can scale the cluster up to be dozens of nodes.
So whilst yes, typical scalable NoSQL stores don’t write to a majority of nodes, they could use Paxos to manage cluster membership and key range leases. Such techniques aren’t intrinsically Paxos. They are applications that manage low volume internal cluster state using a strong consensus algorithm such as Paxos; but manage high volume client state using weaker consistency models. So when leases were first discussed they were explaining how to built cheap scale-out on top of a foundation of strong consistency. That was, and remains, a very valuable technique.
Can we use leases to speed up safe reads under Paxos? Yes. A previous post discussed strong and weak reads. Normally we cannot ensure that reads are ordered with respect to writes under failure conditions without exchanging messages with a majority of nodes within the cluster. If we issue a lease for the role of distinguished leader of the Paxos cluster it means that weak reads work like a read from a local cache that cannot be stale as no other node will update the value until the lease expires. We get the same strong consistency as with strong reads in that they are ordered with respect to pending writes but without having to exchange any messages with other nodes in the cluster to anchor the read with respect to pending writes made by a new leader under any failover scenarios. This approach is described in the the google Chubby paper as summarised on this Morning Paper blog post.
The downside to using leases is that they increase failure recovery time. If a node holding a lease goes dark we need to wait for the lease to expire (plus a safety margin for clock skew) before issuing a new lease to a replacement node. The lease holder which has gone dark may actually be up; there is no way to know that it is truly down. It may be responding to client reads with cached values whilst being unaware that a network partition has segregated it from a majority of nodes. If you don’t wait for the lease to expire and start updating state then you may be causing the out-of-touch current lease holder to be returning stale values.
Optimistic locking or compare-and-swap are examples of a “concurrency safe” application protocols that work fine with cached reads. Such stale-read-safe application protocols are required for typical web applications to work correctly when concurrent users may be performing interleaved read, modify, then update operations to the same data. Any applications already using such standard techniques to handle concurrency safely will work correctly with weak reads; allowing for zero overheads on the read path with fast failover. If reads load dominates writes and there is low contention on the writes then weak reads roundrobined to replicas are more scalable than leased reads.
There are many options here and what is optimal depends on the safety properties of your application but also your read-write contention. This is why leases are not a core part of the Paxos parliament algorithm; leases are just one optimisation. For an excellent description of how to implement master leases safely using only local clocks I can highly recommend the section on that topic in Understanding Paxos.