diego, thanks so much for raft! i'm a student in the brown class you shout out, ...

ongardie · on May 5, 2015

It's a good question, and I don't really know where the community as a whole sits on Byzantine vs non-Byzantine. A few thoughts:

Byzantine is more complex, and most people in industry aren't doing it: there are a lot of Byzantine papers out there but few real-world implementations. I think Byzantine is important for uses where the nodes really can't be trusted for security reasons, and maybe there's easier fault-tolerance payoffs elsewhere when the entire system is within one trust domain such as a single company.

Byzantine consensus is slower and requires more servers.

If you don't have independent implementations running on each of your servers, the same software bug could still take out your entire cluster. You get some benefit if the hardware fails independently, but you don't get protection from correlated software outages. Maybe the difficulty in characterizing which faults a particular deployment can handle makes it harder to sell to management.

With Raft, we were just trying to solve non-Byzantine consensus in a way people could understand, and we think it's still a useful thing to study even if your ultimate goal is Byzantine consensus. You might be interested in Tangaroa, from the CS244b class at Stanford, where Christopher Copeland and Hongxia Zhong did some work towards a Byzantine version of Raft [1][2] and Heidi Howard's blog post on it [3]. But really, Castro and Liskov's PBFT is a must read here [4].

[1] http://www.scs.stanford.edu/14au-cs244b/labs/projects/copela...

[2] https://github.com/chrisnc/tangaroa

[3] http://hh360.user.srcf.net/blog/2015/04/conservative-electio...

[4] http://pmg.csail.mit.edu/papers/osdi99.pdf