pocdb is a simple Paxos-based key-value store designed to allow people to easily develop a working knowledge of the Paxos algorithm. In this post, we will walk through the basic overarching design of pocdb and the principles behind this design.
Straw Man Design
The Paxos algorithm is often intimately tied to the notion of recording a log of operations. Under this paradigm, it is easy to build a key-value store using off-the-shelf Paxos libraries by shunting all reads and writes to the Paxos algorithm and then performing them in the order in which they appear in the log.
This design suffers from poor performance. By passing all operations through the replicated log maintained by Paxos, the system is inherently limited to the throughput of a single Paxos group, which is often (but not always) limited to the throughput of a single server. Operations are durably logged to disk by the Paxos implementation, and then again written to disk when transferred from the log to the key-value store.
For key-value stores whose API solely deals with single key operations, there is a much better alternative that can remove many of the bottlenecks of this straw man design, and be simpler to implement as well. Instead of using one instance of Paxos for all keys, we can use one instance of the Paxos protocol per key. As we shall see, this greatly reduces implementation complexity by avoiding some of the more difficult aspects of the Paxos protocol.
In pocdb, each key is its own Paxos-backed state machine. Each slot, or log entry, agreed upon by Paxos corresponds to a different version of the key written to the data store. To write a new version of a key, the data store picks the next lowest unused slot and performs a round of Paxos to agree upon the value for that slot. The system performs multiple rounds of Paxos in parallel so long as the rounds are for different keys. This has the same effect as pipelining or batching in the straw man design.
To achieve this behavior, pocdb slightly extends the traditional Paxos Synod protocol. For each write that comes into the system, a pocdb node will perform both phases of Paxos to complete the write. During Phase 1 of Paxos when a ballot leader gathers promises to follow the ballot, the ballot leader also determines which version it can write. An attempt to write an old version will fail Phase 1 just like an attempt to gather promises with too low of a Paxos ballot. At the end of Phase 1, the ballot leader knows that a quorum of servers have promised to follow its ballot, and that the version number it is writing is at least as high as all previous version numbers written for that key.
During Phase 2 of Paxos, the ballot leader issues its write to the quorum using the ballot and version chosen in Phase 1. If the ballot has not been pre-empted by a higher ballot the Phase 2 message will succeed; otherwise, the acceptor will send a retry message back to the leader, which will retry the process from the start.
Don't Special-case Failure
The strength of the Paxos algorithm is that the path taken by the leader is the same during normal execution and in the presence of failures. The logic for writing a single key is less than 100 lines of code. It consists of a loop that is periodically executed to send either a Phase 1a or Phase 2a message to the other hosts until such a time that the host replies with a Phase 1b or 2b message. These two loops need not special case failure-they will operate correctly whether the other hosts are slow, failed, or completely operational.
What's the Catch?
pocdb is not a perfect design. In particular, it is missing several features that would find their way into a more complete system. These features do not require that the fundamental design of pocdb change, but become tedious pieces of code to implement and maintain; in the interest of keeping pocdb simple and readable, these features were omitted (or left as exercises to the reader).
Consistent get operations : The current implementation does not read consistently. This was done to keep the implementation time under three hours. The simplest way to add consistent reads would be to convert each read into a write of the current value-this passes the read through Paxos and guarantees consistency. There are multiple protocols for adding consistent reads to Paxos that do not require performing a full round of Paxos to learn the value, and these could be added as well.
Delete operations : The pocdb API has no method by which a key can be removed from the system once added. This was intentional as the state of the key stored in the data store is also used as the acceptor state. Deleting the key safely would require more mechanism to make sure that data remains consistent. One easy way to go about implementing delete operations would be to maintain tombstones of deleted keys and a special key with metadata about the minimum version at which new keys can be written. The system could update the minimum version key periodically, and enforce this minimum version when processing Phase 1a messages. Any tombstones less than this version can be safely removed.
More compact data representation : The pocdb data representation stores both acceptor and value state in the key value store, with redundant information. It is possible to reduce this overhead and consolidate the data to reduce the number of writes performed.
Learn messages : While pocdb can reply to the client in two round trips, it requires three round trips to fully commit the data to disk. With a more compact representation, and an altered and more expensive read path, the entire write could happen in two round trips.
Multi-threading : The current system is single threaded and reads and writes from disk within that thread. The system could be extended to be multi-threaded by providing in-process locking for each key while performing any acceptor operations. This would keep the Phase 1a and 2a message processing atomic, and would only limit concurrency for operations on the same key. In the same vein, there's a (noted) race condition with learning values where the value is blindly written to disk, while there should be a check to make sure the written value is not overwriting a newer value. This is a read-modify-write and was omitted for time; a simple get before put would fix the race condition.
Paxos group changes : There's no mechanism for handling a change of group membership. A simple protocol to correct this would be a two-phase commit across the nodes to change the groups. Between the two phases, acceptors would halt writes in order to avoid inconsistency. More complicated mechanisms, such as those used in Consus allow writes to continue during a group membership change without hindering availability.
pocdb provides a basic skeleton for a scalable key-value store. The code is intended to provide a basic understanding of Paxos and illustrate one way in which Paxos can be used that is different from the replicated log employed by most Paxos-based systems.
pocdb code : The pocdb code is on GitHub . The repository will not be updated as the point was to implement a proof of concept in a limited amount of time.
Paxos made moderately complex : Those looking for a more thorough understanding of Paxos should checkout paxos.systems for a thorough description of Paxos, its invariants, and its implementation. This website, and the accompanying paper, work to break down the Paxos protocol into easy-to-digest pieces that are much more approachable than the original Paxos paper.
Egalitarian Paxos : Egalitarian Paxos improves upon the performance of the Paxos by improving wide area commit latency and avoiding the single-server bottlenecks that were alluded to above.
Consus : pocdb is not intended as a production-ready system. It has multiple (intentional) shortcomings that are designed to improve readability at the expense of performance or functionality. Consus is a more complete key-value store that my advisor and I are developing. It uses multiple distinct implementations of Paxos to provide strong, transactional guarantees in the wide area. Because Consus is intended to be more than a proof-of-concept, the implementation is significantly more complex and not intended to be a starting point for people to learn about Paxos.