Introducing Consus: Fast Geo-Replicated Transactions
By Robert EscrivaConsus is a consistent, geo-replicated key-value store. It can run in multiple data centers and allow clients to initiate and run transactions in any data center. This symmetric design enables fast recovery from either data center, network failures, or dead horses and drunken hunters . Because every data center is on equal footing within the system, a failure allows clients of Consus to transparently use a different data center without an operator on standby to hit the big red panic button.
Consus is a research project from Cornell's Systems and Networking Group that explores new ways to work with data in the wide area. The research contribution behind Consus is a commit protocol that can execute in three one way message delays in the wide area. If your data centers were equidistant from each other with a uniform round trip time of 100ms between any pair of data centers, Consus could commit each transaction with 150ms of wide area latency. This latency, while non-trivial, is close to the bare minimum necessary to receive any acknowledgement whatsoever that a transaction has safely reached more than one data center—for that you would need at least a 100ms delay to complete a single round trip.
Beyond the commit protocol, Consus is an exercise in the judicious drawing of abstraction boundaries. Internally, the system uses four different variants of Paxos and three more variants of replication that are weaker than Paxos; yet, understanding the system as a whole can be done without having to fully comprehend the setup of the entire stack. For example, Consus' commit protocol uses Paxos across multiple data centers with a different member of the Paxos group in each. Realistically, we would like for these different members to never fail unless the data center containing them fails. We could accomplish this by buying really expensive fault tolerant hardware. This is wasteful; we already know how to make a fault tolerant machine by using Paxos. In the wide area, we need the abstraction of a machine that rarely fails. We can build our system using one of these (almost) "magical" machines without regard for its implementation. Then we can make the illusion complete by building a fully fault tolerant single "machine" without any magic. The system on top knows nothing of the illusion beneath the magic, and the illusion need not know anything about what's on top. Yes, Consus puts Paxos in your Paxos so you can replicate while you replicate.
Open Source Release
Today marks the first public release of Consus outside of Cornell. The system has seen just under a year of development alongside a preliminary publication of the ideas embodied within the system. The system and the paper are both still evolving; anyone testing the code should consider Consus to be a fun experiment. The natural caveat for a system like this applies: At this time we don't believe Consus is production ready. I mean, it probably won't summon any demons or eat your first born, but it probably also isn't safe to babysit your data full time (yet).
You can find the paper online at arxiv . The paper is standard fare for an arxiv release. It will likely undergo revision between now and its final peer review publication. Critically, it is missing any evaluation or benchmarks of the system.
Along with the paper, we have updated the Consus homepage with references to the complete source code. Not only are does today's release include the complete system source code, but it also includes the release engineering code used to test Consus on a variety of platforms, and the sources for the Consus website itself. It is our hope that this openness will be the seed necessary to grow a successful community around Consus—not one in which a company with profit motives releases a half-open product, but one in which those interested in Consus can come together to build something better.
Discussion
My co-authors and I have decided to release Consus in order to gather more feedback about the system, and hopefully accelerate its development. We are currently in the process of testing and performance tuning the system so that we can collect reproducible experimental results for our paper.
Often the evaluation of a system is seen as the single biggest determining factor in whether a particular piece of work is worthy of publication, or one of the many papers rejected from a conference. The evaluation is important, and gives the reader an idea of how to best understand the strengths of the system. With projects developed by small teams, the evaluation often becomes a race against the clock and deadlines force a choice between a more thorough evaluation and a more thorough implementation. Not every system needs to be production ready—or even have production readiness as a future goal—but it is important that every system evaluation reflect the ideas tested in the academic work. Too often, individuals reading about research projects have to evaluate the artifact through a layer of paper indirection that obscures what the implementation actually achieves.
With Consus, we wish to change this. By providing the source code to the public, anyone can freely take the code and check the state of the system. Its capabilities and shortcomings can be freely made known even before the system's eventual peer reviewed publication. Going outside the normal peer review process may help the small development team behind Consus more rapidly converge on a platform useful for both evaluation and as a production system.
Doing this is not without risk. With all our work in the open, it becomes possible for others to take Consus, and build something better on top of what our team has done. I would like to make a personal request to other academics to do their best to take advantage of the opportunities afforded by the publication of Consus, without limiting our potential for publishing our work in the future. If we were not publishing Consus in its entirety today, the system would remain unavailable behind closed doors until much later in the publication process. This early release allows for quicker iterations of the research process, and allows those not within academic circles to get a peek at what is happening within Cornell. It is our hypothesis that doing so will ease the eventual formal peer review and publication process; it is our hope that doing so will serve as a case study in ways to improve the review process as a whole as artifact evaluation becomes more pertinent to our field.
Summary
Today we have released the Consus paper and accompanying source code. There are too many facets to cover in a single blog post (or even a 12 page paper); if there is one thing to take away from this post it is the commit protocol of Consus enables efficient wide area transactions that are within 50% of the minimum required for any form of safe replication.
In future blog posts, we will explore more about the Consus system, as well as dive deep into individual components, implementation tricks, and core concepts at play.
Future Reading
- Consus Source Code This is the complete source code for consus.
- Consus Paper This is the current revision of the Consus paper.
- HyperDex Consus is the logical successor to HyperDex. While many features of HyperDex are not yet implemented, Consus takes lessons learned with HyperDex to create a more robust system that is more amenable to today's geo-replicated environment.
- Spanner Spanner is Google's geo-replicated transactional key-value store. Consus is an independent project, but certainly related to Spanner in elements of its design. The critical difference between Spanner and Consus is that Spanner may require that a transaction incur multiple cross-data center latencies during its execution and does require that a transaction execute 2-phase commit to finish. This results in a minimum of four message delays in the wide area, and potentially many more. Consus will incur three in the common case.
- CockroachDB CockroachDB builds on top of Raft replication to provide a system similar to Spanner. Like Spanner's Paxos groups, CockroachDB's Raft groups can incur multiple round trips in the wide area during transaction execution and at commit time.