I have a slightly off-topic question for Diego and other people experienced with distributed systems.
Why are consensus algorithms always developed as systems, not as libraries? Zookeeper, etcd and LogCabin all operate as a cluster of processes which other nodes connect to over a client library.
I can imagine that the distributed-state-machine-replication-mechanism of Raft or ZAB being implemented as a library where the user has to provide an implementation of the communication layer. Such a library can be used as a starting point for building other more complex systems which aim to provide a low-friction install experience. For example, one good thing about both Cassandra and ElasticSearch is that they both have homogeneous clusters where all nodes play the same role. Incidentally, from what I understand, they both embed a consensus implementation within.
Similarly, a membership service (and failure detector) over gossip protocols will also be very useful.
An installation guide which starts with "First, install Zookeeper cluster. Then, install a SWARM cluster. Then ..." is not very appealing. That being the case, I wonder why there is no mature OSS library which provides these services. What does HN think about this situation?
This question came up at the CoreOS Fest earlier today too. I think everyone wants libraries as well as services, and I'd like that too for LogCabin one day. It's just a bit harder when you start thinking about the details. It's not impossible though. As one example, CockroachDB is using etcd's Raft implementation as a library.
Part of the issue is that we want to have libraries with small simple interfaces, and a consensus library is probably going to be on the large side, as it has to interface with the disk, the network, and the state machine, plus membership changes, log compaction, who's the leader, configuration settings, debug logging, etc.
Another issue is that, as a library, it has to be in the language you're using. And if that language is C++ or similar, it has to be compatible/convenient with whatever threading approach you're using.
Then there's performance/memory. Some Raft implementations are designed to keep relatively small amounts of data in memory (like LogCabin or etcd), and others are meant to store large amounts on disk (like HydraBase). Some are optimized more for performance, others for safety.
I think we'll get libraries eventually. Keep in mind Raft is still very young. Paxos is around 26 years old, Raft is only .5 to 3 years old (depending on when you start counting). I like to think that Raft lowered the cost of developing a consensus system/library significantly, but it still takes time to develop mature implementations. Right now we have a lot of implementations in early stages; I wonder if some of these efforts will consolidate over time into really flexible libraries.
Thanks Deigo! From yours and other siblings comments what I gather is that the idea of a library is not fundamentally flawed, it is just that the state of the art has not matured to that level.
In my experience, consensus algorithms are developed as algorithms first, (reference) systems second, and libraries third. The reasons are simply that the algorithm is the most important part, and good libraries are hard. Diego has rightly focused on the algorithm, which has reached maturity before this LogCabin release.
I work on CockroachDB, which uses etcd's Raft implementation, so we are an existence proof that consensus libraries are possible. However, the library we share was the second attempt by the authors (Xiang Li and Yicheng Qin) to create a Raft implementation. The first attempt to develop a "library form" of a consensus algorithm is likely to reach a dead end, but over time it is possible to develop a library-style implementation.
A "library" that wants to have a process of some sort continuously running encounters the problem that the more cross-language generic you try to make the library, the less possible it is to work out a threading model that can be used by everyone, and the less language-generic you make it, the smaller your target audience is, which has non-linear affects on the usage and development resources it can attract. Languages that include a "blessed" runtime model like Node or Go can easily ship such "daemonized libraries", but then they are generally impossible to bind to for anybody else. Using POSIX threads, by contrast, will lock a lot of other languages out that have very sophisticated runtimes, be very difficult to use in others, and, even in C, when you get right down to it, there's no guarantee that they'll play nicely with the rest of the C program.
Distributing a service invokes the sort of lowest-common-denominator solution to this problem, which is "OS process". Everyone can talk to an OS process.
(This is an explanation, not advocacy or celebration.)
Well etcd started out using the goraft library, so this isn't always true, but I can think of some reasons I'd prefer to have the quorum as a separate system:
Writing it as a service lets you implement it in once in a high-level language like Java or Go and use it from all languages. To get the same portability in a library you'd have to write it in C. It's hard enough to maintain consistent data without worrying about memory corruption bugs.
The protocol implementation is only a part of durable consensus, you also need durable storage, sensible network timeouts, shutdown handling, etc.
You usually want different configurations for your application and your quorum. For the quorum you usually want 3, 5 or 7 members, while your application could be anywhere from one to hundreds of instances. Your quorum members must always know how many other members are supposed to be in the quorum, while your application could remove or add instances on the fly. For the quorum you need low latency, while you may want to optimise your application for throughput. (e.g. disks, garbage collection, swapping)
Service Fabric does this. You provide the stream of operations and Fabric replicates them using distributed consensus.
Databases can be hosted on the framework. DocumentDB is, for example. While at MS, I wrote a near-real-time metrics system using Fabric and worked on a tunable-consistency distributed cache built on Fabric.
The Rust Raft library[0] (not my project) is a library; you must specify how the store[1]/statemachine[2] works under the hood, though it comes with default implementations for both. I'd like for it to abstract over communication channels too, but currently it's tightly wired to an RPC and IO library and I don't have the time to try and fix that.
I'm not a C++ programmer, and work mainly with Go. I'm curious to know if it is usual for C++ developers to implement their own event loops for network transports, as Diego has done here [0]. The other example I know is Replicant [1], which is used by HyperDex, and it uses a custom event loop too [2].
LogCabin uses its event loop for network operations but then hands requests off to threads to process. I started out with libevent2, but the problem is it doesn't deal with having multiple threads very well (error-prone and inefficient). It's also not as well-documented as the man pages for epoll, so I ended up using epoll directly instead. What was really lost in the libevent2 -> epoll conversion was platform independence, but I think it might be better to get that back through well-placed #ifdefs or relying on some other library; I wouldn't go back to libevent2.
We're using libev (actually libev++ which is cross-platform, has a decent underlying C impl, and a C++ bridge API). It seems to work well and is fast and cross platform. It can be a little tricky to figure out what is going on with the C++ API sometimes due to how the author tacked it on using macros.
Yes -- it is very straightforward to use and "just works". (It is also used in a similar way as found in Node in the luvit project: https://github.com/luvit/luvit)
This is fairly common. I've done professional work with Go and C++ and this is one of the biggest reasons I like Go over C++. C++'s concurrency primitives aren't quite good enough so you always need to build something on top of them. Using a library in C++ often requires grokking how it does concurrency and reconciling that with the way your code does concurrency. In Go everyone uses goroutines so you can very quickly understand how to use libraries.
C++ developers don't make their own event loop. They use epoll which is provided by the kernel. That is still the area the C/C++ ( and Rust ) have over Go, they can use syscalls without requiring the goodwill of one of the language developers.
1) Ah, cool, creator of Raft algo, so some of the 'obvious' mistakes in an implementation should've been resolved by now (though if ppl weren't trying to use it in production.... who knows).
2) Great, C++, it should be efficient and fast with consistent RAM usage (Go's GC is a bit.... eh... still).
3) Oh, you need a C++ client library. :(
I would love to say that API's don't matter, but they do. So, so, so much. If they didn't, etcd would never have had a chance against Zookeeper. The Zookeeper folks are looking at adding RESTful API's to allow functionality ala-etcd, because its obvious a convenient API is a huge win. Any distributed system solution attempting to gain steam should consider this from the beginning now, as the bar has been set.
I think I'd have to agree with you now: REST APIs seem to help with adoption. LogCabin was initially created for use with RAMCloud ( http://ramcloud.stanford.edu ), which mostly hand-rolls its RPC serialization to achieve its extreme performance goals (it budgets about 1 microsecond in software overhead per RPC). I thought I was being user-friendly in LogCabin by using protobufs, and at the time, something as embarrassingly slow as HTTP+JSON was unthinkable. I don't think it's too late to add a REST API to LogCabin, and it'd be pretty easy to make a REST proxy. Maybe that's worth doing for easier adoption from other languages. I also think the CLI client makes the barrier to entry pretty low for people that just want to test it out.
Tons of people are super comfortable with HTTP and have tooling built around testing and debugging it, even if the performance is crappier.
It's an ugly cousin of premature optimization that you have to deal with if you're producing APIs for your software. If the API requires anything that isn't braindead simple, you are going to lose out to competitors due to the learning curve.
diego, thanks so much for raft! i'm a student in the brown class you shout out, and i can testify as to its relative simplicity and the clarity with which you guys communicate the ins and outs.
i have a question for you, though. why is raft not concerned with byzantine failure? the focus on byzantine fault tolerance from the paxos family of algos (and a lot of the literature/educationally material on distributed consensus) makes me feel like it's important, but your approach suggests it perhaps isn't. do you think this focus is a side-effect of the ubiquity of paxos which is disproportionately concerned with this due to its roots in academia?
It's a good question, and I don't really know where the community as a whole sits on Byzantine vs non-Byzantine. A few thoughts:
Byzantine is more complex, and most people in industry aren't doing it: there are a lot of Byzantine papers out there but few real-world implementations. I think Byzantine is important for uses where the nodes really can't be trusted for security reasons, and maybe there's easier fault-tolerance payoffs elsewhere when the entire system is within one trust domain such as a single company.
Byzantine consensus is slower and requires more servers.
If you don't have independent implementations running on each of your servers, the same software bug could still take out your entire cluster. You get some benefit if the hardware fails independently, but you don't get protection from correlated software outages. Maybe the difficulty in characterizing which faults a particular deployment can handle makes it harder to sell to management.
With Raft, we were just trying to solve non-Byzantine consensus in a way people could understand, and we think it's still a useful thing to study even if your ultimate goal is Byzantine consensus. You might be interested in Tangaroa, from the CS244b class at Stanford, where Christopher Copeland and Hongxia Zhong did some work towards a Byzantine version of Raft [1][2] and Heidi Howard's blog post on it [3]. But really, Castro and Liskov's PBFT is a must read here [4].
You end up needing consensus for a lot of fault-tolerant systems that need to provide consistent results. For example, if your system allows users to choose their own usernames, and you're trying to guarantee that all usernames are unique, and your system needs to automatically deal with server failures, then I think you also need consensus.
Another way to think about it is that consensus gets you the equivalent of a compare-and-swap operation in a distributed setting. Just as compare-and-swap is useful for building synchronization primitives with shared memory, consensus is useful for building synchronization primitives across a network.
When you are building distributed systems often you need a way to coordinate between nodes. You can use a single node to do it, but then you have a single point of failure.
I come from an academic lineage of log-based projects, from log-structured filesystems [1] which structure disks as a log, to RAMCloud [2][3][4] whose durability/recovery aspects are a distributed and partially in-memory extension of that, to Raft and LogCabin that are built around the concept of a replicated log for consensus.
LogCabin used to export a log-oriented data model, by the way, where the name made a bit more sense even. There was some talk of renaming it to TreeHouse now that it exports a key-value tree, but that one didn't really catch on.
Maybe, but I don't think it matters much. The comment there explains the issue, and I vaguely remember my measurements showing that this wasn't a big deal.
I do think it's an interesting (dare I say) limitation of epoll, and something I'd think about if I was redesigning epoll.
Please file an issue on github if you have ideas on how to improve this without adding significant complexity.
Yes, good point. I welcome Mr. McCaffrey to give it a spin, if he's so willing. And though there may be critical bugs left to find, I don't feel like I've misrepresented the current state of LogCabin.
Nice work! A Curator-like (http://curator.apache.org/) framework on top of LogCabin will make it immensely more useful for everyday production systems though.
In part that's because I don't see other Raft implementations as competitors. A major goal in Raft was to enable many implementations. It'd be a huge fail if everyone switched to LogCabin and abandoned all the other Raft implementations.
Really though, it's probably not a feature list that'll make people want to use LogCabin. In Scale Computing's case, it's that they have a C++ code base that they could integrate LogCabin well with, and they know they can work with and maintain the code if they need to.
Just the other day, I was looking for a raft consensus implementation in C++ and was disappointed at not finding any that were production ready (as opposed to Java and Go). This fits the need perfectly. Also, RamCloud says it requires ZooKeeper. Do you have any idea of the timeframe of when RamCloud will use LogCabin instead.
Good timing then, and I'd be happy to talk more about whether LogCabin is a good fit for your use.
RAMCloud used to depend on an earlier version of LogCabin (before the data model was a key-value tree), then John rewrote the RAMCloud end of that and switched it to using ZooKeeper, but he made it pluggable on the RAMCloud side. So we just need an "ExternalStorage" implementation for LogCabin in RAMCloud to make that work again. It should be relatively straightforward and I'm confident they'd welcome a well-written and tested patch, but no one has volunteered yet.
In raft how can a client know it is receiving stale, yet agreed upon, data? I'm talking about the edge case of having a network partition or some other condition like it.
Raft implementations can choose to implement reads in various ways. In LogCabin reads are linearizable, meaning that the results are current as of sometime after the read request was initiated, and it doesn't rely on bounded clock drift to make this guarantee. That's about the best you can do in terms of freshness in a distributed system. Other implementations might offer weaker read semantics or give the client the option. Check out 6.3-6.4 in my dissertation for a ton more detail: https://github.com/ongardie/dissertation#readme
I'm still misunderstanding what happens when a network partition creates a two leader scenario. It is in a slide here http://thesecretlivesofdata.com/raft/ under "log replication" after creating a network partition that separates CDE from AB. Doesn't that mean a client of node B could read stale data? Or maybe an implementation like etcd would let the client choose to read or not in that case?
If the network is split into AB and CDE, then only CDE will be able to reach a quorum. If a client is on the AB side of that split, no consensus algorithm could guarantee freshness on a read. LogCabin clients will continually retry until they can talk to a functioning leader (and clients can put a timeout on that) so that you get a guaranteed fresh read every time; other systems could choose to return a stale result right away instead. It's up to the implementation, but I strongly recommend people not do stale reads by default.
Compaction in LogCabin uses a snapshotting approach. It writes a header to a snapshot file, then forks off a child process to write the data into a snapshot. The data is just each node in the tree serialized as a protobuf using a pre-order traversal, IIRC. Much more detail on compaction [1] mentioning LogCabin explicitly.
LogCabin does currently keep all data in memory, so it never touches disk for read requests. It also keeps a copy of the entire log in memory for now, but I hope to address that eventually [2].
Why are consensus algorithms always developed as systems, not as libraries? Zookeeper, etcd and LogCabin all operate as a cluster of processes which other nodes connect to over a client library.
I can imagine that the distributed-state-machine-replication-mechanism of Raft or ZAB being implemented as a library where the user has to provide an implementation of the communication layer. Such a library can be used as a starting point for building other more complex systems which aim to provide a low-friction install experience. For example, one good thing about both Cassandra and ElasticSearch is that they both have homogeneous clusters where all nodes play the same role. Incidentally, from what I understand, they both embed a consensus implementation within.
Similarly, a membership service (and failure detector) over gossip protocols will also be very useful.
An installation guide which starts with "First, install Zookeeper cluster. Then, install a SWARM cluster. Then ..." is not very appealing. That being the case, I wonder why there is no mature OSS library which provides these services. What does HN think about this situation?