New ScyllaDB Go Driver: Faster Than GoCQL and Its Rust Counterpart

insanitybit · on Oct 13, 2022

Are there simple benchmarks that I can run for the Rust counterpart? I've worked a bit on the scylla rust code and I see plenty of room for improving efficiency (there's a lot of unnecessary allocation imo, and the hashing algorithm is 10x slower than it needs to be), but I don't want to make a PR for improvements without evidence.

> The big difference between our Rust and Go drivers comes from coalescing; however, even with this optimization disabled in the Go driver, it’s still a bit faster.

For anyone who's wondering, the Rust driver has coalescing support as of 9 days ago.

dorlaor · on Oct 13, 2022

My understanding is that the new Rust coalescing will make the situation on-par with the new go driver. However, in the second part of the blog there is a no-coalescing test where go is still faster and allocates less memory. I'm sure that the Rust driver can get there too

enedil · on Oct 13, 2022

I wonder why you got downvoted, the comments are on point.

Disclaimer: I work for ScyllaDB, although not on drivers. I can forward your question to relevant people.

insanitybit · on Oct 13, 2022

FWIW my company is a customer so we've already got a shared Slack/ account reps :P Feel free to reach out to colin@graplsecurity.com though (me) if you want to chat about it.

yitr · on Oct 13, 2022

fyi social links at the bottom of https://www.graplsecurity.com/ map to the wrong things (linkedin to discord, github to linkedin)

insanitybit · on Oct 14, 2022

Thanks

mvelbaum · on Oct 13, 2022

I wonder if Tokio is also a reason for worse performance compared to Go's concurrency runtime.

insanitybit · on Oct 13, 2022

There are probably a bunch of reasons, which is why I want an easy "run benchmarks" command that I can use. I'd even be fine using infra so long as I had pulumi/terraform to set it all up for me.

I just don't want to spin up EC2 instances manually, get the connections all working, make sure I can reset state, etc.

I already have a fork of Scylla where I removed a lot of unnecessary cloning of `String` but no way I'm gonna PR it without a benchmark.

I also opened a PR to replace the hash algorithm used in their PreparedStatement cache, which gets hit for every query, but they wanted benchmarks before accepting (completely fair) and I have none. `ahash` is extremely fast compared to Rust's default - https://github.com/tkaitchuck/ahash and with the `comptime` randomness (more than sufficient for the scylla use case) you can avoid a system call when creating the HashMap.

There are also some performance improvements I have in mind for the response parsing, among other things.

insanitybit · on Oct 13, 2022

So, I got sniped hard by this.

1. I've re-opened by hashing PR and I'm going to suggest that they adopt ahash as the default hasher in the future.

2. I've re-written my "reduce allocations" work as a POC. Another dev has done similar work to reduce allocations, we took different approaches to the same area of code. I'm going to try to push the conversation forward until we have a PR'able plan.

3. I'm going to push for a change that will remove multiple large allocations (of PreparedStatement) out of the query path.

4. Another two devs have started work on the response deserialization optimizations, which is awesome and means I don't have to even think about it.

I think we'll see really significant performance gains if all of these changes come in.

PoignardAzur · on Oct 13, 2022

> I just don't want to spin up EC2 instances manually, get the connections all working, make sure I can reset state, etc.

I've been thinking about this lately.

I wonder if we could standardize a benchmark format so that you could automatically do the steps of downloading the code, setting up a container (on your computer or in the cloud), running the benchmarks, producing an output file, and making a PR with the output.

So developers would go "here's my benchmark suite, but I've only tested it on my machine", and users would call "cargo bench --submit-results-in-pr" or whatever, and thus the benchmark would quickly get more samples.

(With graphs being auto-generated as more samples come in, based on some config files plus the bench samples)

insanitybit · on Oct 13, 2022

Interesting idea. I could imagine something like that but it's a bit tough.

ianpurton · on Oct 13, 2022

So would the ideal solution be if ScyllaDB had a github action to run benchmarks against PR's?

Not sure how decent a benchmark would be without running up servers in the cloud. So I guess provisioning infra would be a requirement?

So perhaps this could be run manually. But certainly possible

- Pulumi up infra - Run benchmarks - Collect results - Attach to PR.

insanitybit · on Oct 13, 2022

I'd be happy with a few things:

1. Benchmarks of "pure" code like the response parser, which I could `cargo bench`. I may actually work on contributing this.

2. Some way to run benchmarks against a deployed server. I wouldn't recommend a Github action necessarily, a nightly job or manual job would probably be a better use of money/resources. If I could plug in some AWS creds and have it do the deployment and spit out a bunch of metrics for me that'd be wonderful.

indiv0 · on Oct 13, 2022

I just did a comparison between almost every hashing algorithm I could find on crates.io. On my machine t1ha2 (under the t1ha crate) beat the pants off of every other algorithm. By like an order of magnitude. Others in the lead were blake3 (from the blake3 crate) and metrohash. Worth taking a look at those if you’re going for hash speed.

I don’t have the exact numbers on me right now but I can share them tomorrow (along with the benchmark code) if you’re interested.

insanitybit · on Oct 13, 2022

The PR I have lets you provide the algorithm as the caller, although I did benchmark against fxhash and I think it would be a good idea to suggest `ahash`. I'm certainly interested.

`ahash` has some good benchmarks here: https://github.com/tkaitchuck/aHash/blob/master/FAQ.md

virtualritz · on Oct 13, 2022

aHash claims it is faster than t1ha[1].

The t1ha crate also hasn't been updated in over three years so the benchmark in this link should be current.

[1] https://github.com/tkaitchuck/aHash/blob/master/compare/read...

Edit: if you really think tha1 is faster I would open an issue on the aHash repo to update their benchmark.

ComputerGuru · on Oct 13, 2022

FYI Small hashes beat better quality hashes for hash table purposes.

QuadDamaged · on Oct 13, 2022

Hi, do you know if there's a recent hash benchmark I can look into? I am using `FnvHash` as my go-to non-crypto-secure hash for performance reason, didn't realise there could be faster contenders.

Thanks!

llimllib · on Oct 13, 2022

This is the best, most comprehensive hash test suite I know of: https://github.com/rurban/smhasher/

you might want to particularly look into murmur, spooky, and metrohash. I'm not exactly sure of what the tradeoffs involved are, or what your need is, but that site should serve as a good starting point at least.

insanitybit · on Oct 13, 2022

These seem decent: https://github.com/tkaitchuck/aHash/blob/master/FAQ.md

raverbashing · on Oct 13, 2022

I wonder how many people are using 'async' just for the sake of it without a real need for it and shooting themselves in the foot while at it

pkolaczk · on Oct 13, 2022

Actually in this case async is the only way to get sane performance and both drivers deliver excellent performance thanks to async. I've been using Scylla Rust driver in my C* benchmarking project and it is an order of magnitude faster than the tools which use threads.

https://github.com/pkolaczk/latte

raverbashing · on Oct 13, 2022

Cool, good to know. I know threads are a limiting factor, but sometimes people jump into async while the problem is somewhere else

pkolaczk · on Oct 13, 2022

In this case each request is very tiny amount of work on the client, so waking up a thread to do that work just to immediately block waiting on the response from the server is very wasteful. With async you can send hundreds of requests in a simple loop, on a single thread. It's not only more efficient but also actually easier to write.

PeterCorless · on Oct 13, 2022

You might want to check out ScyllaDB Stress Orchestrator. Not sure of the current state of the code, but it's meant to do what you are talking about:

https://github.com/scylladb/scylla-stress-orchestrator/wiki/...

insanitybit · on Oct 14, 2022

Thanks, this looks like it could form the base of what I'd like.

lukeqsee · on Oct 13, 2022

ScyllaDB's obsession with performance by working closely with deep understanding of hardware and software and not simply adding more machines is really impressive.

They consistently demonstrate that we are under using our CPUs compared to potential.

nvarsj · on Oct 13, 2022

I think it’s a common principle of modern computing. We trade productivity for performance all the time. The idea being machines are fast enough it doesn’t matter. There are order of magnitude gains to be made at most levels in the modern stack - it’s just the effort required is immense.

rob74 · on Oct 13, 2022

The effort doesn't have to be immense though. I bet there are plenty of "low-hanging performance fruit" in most codebases, it's just that there's no real reward to pick them...

jerf · on Oct 13, 2022

My own experience backs this. I don't sit here obsessing about performance. In a database context that is appropriate but I'm not implementing databases. Nor do I prematurely optimize. I just give some thought to the matter occasionally, especially at the architecture level, and as a result I tend to produce systems that often surprise my fellow programmers at its performance. And I strongly assert that is not because I'm some sort of design genius... ask me about my design mistakes! I've got 'em. It's just that I try a little. My peer's intuitions are visibly tuned for systems where nobody even tried.

I'm not sure how much "low hanging" fruit there is, though. A lot of modern slowdown is architectural. Nobody sat down and thought about the flow through the system holistically at the design phase, and the system rolled out with a design that intrinsically depends on a synchronous network transaction every time the user types a key, or the code passes back and forth between three layers of architecture internally getting wrapped and unwrapped in intermediate objects a billion times per second (... loops are terrible magnifiers of architecture failures, a minor "oops" becomes a performance disaster when done a billion times...) when a better design could have just done the thing in one shot, etc. I think a lot of the things we have fundamental performance issues with are actually so hard to fix they all but require new programs to be written in a lot of cases.

Then again, there is also visibly a lot of code in the world that has simply never been run through a profiler, not even for fun (and it is so much fun to profile a code base that has never been profiled before, I highly recommend it, no sarcasm), and it's hard to get a statistically-significant sense of how much of the performance issues we face are low-hanging fruit and how much are bad architecture.

pornel · on Oct 13, 2022

This is a dare to get people rewrite their Rust driver for performance.

avgcorrection · on Oct 13, 2022

Go/Rust rivalry is a meme.

_wldu · on Oct 13, 2022

Every language dreams of being faster and safer than C and C++. Can't have it both ways.

yencabulator · on Oct 13, 2022

A different language can enable easy expression of designs that would be nightmare to maintain in C/C++: https://go.dev/talks/2013/oscon-dl.slide

psaux · on Oct 13, 2022

One of my fav companies! Dor is an amazing person, highly recommend working with them. We started to move to ScyllaDB at my last engineering job. Doing 100TB’s a day in IoT Data.

snapetom · on Oct 13, 2022

My last job was far less data than yours (more like 100's of GB/day), and it wasn't surprising it handled it with ease. I love that DB so much. So performant, and far easier to admin than Cassandra.

ZephyrBlu · on Oct 13, 2022

What is that amount of data used for?

Is it along the lines of “we want to collect all the data we can in case we want to use or analyze it at some point”, or are there real use cases?

PeterCorless · on Oct 13, 2022

For IIoT, there's digital twins, machine health, sensor data aggregation, and a lot more. While a lot of it can be classified as "write once, read hardly ever," it is vital for triggers, alarms, real-time failure and security alerts, and, eventually being forwarded over to analytics systems for longer-term trends.

habibur · on Oct 13, 2022

> We also paid close attention to proper memory management, producing as little garbage as possible.

The key.

And I was wondering how can a tracing GC outperform a non-tracing-GC memory manager.

chrisseaton · on Oct 13, 2022

Manual memory management means your memory management is tightly interleaved with application code, with both competing for the same limited cache space. A tracing GC batches memory management, using cache better and not as frequently evicting your application’s data out of cache.

A GC can also let you use more efficient concurrent data structures - many sophisticated concurrent objects require a tracing GC for implementing correctly - which can improve the performance of your application code.

tialaramex · on Oct 13, 2022

> many sophisticated concurrent objects require a tracing GC for implementing correctly

What sort of "sophisticated concurrent objects" are you thinking of?

chrisseaton · on Oct 13, 2022

Implementing many lock-free data structures without a GC is still a research topic, for example requiring conversion to use epoch allocation.

tialaramex · on Oct 13, 2022

This still seems very vague, I was looking for concrete examples where quite different arrangements like hazard pointers are for some reason impossible.

chrisseaton · on Oct 14, 2022

Hazard pointers, epoch reclamation yeah are both examples of tools to work around the lack of GC, but they require new work to figure out how to apply them and they harm your throughput which is probably why you wanted these data structures in the first place.

https://medium.com/@tylerneely/fear-and-loathing-in-lock-fre...

ruuda · on Oct 13, 2022

Here's one design that's nice to do for servers in GC'd functional languages, and hard to achieve without GC. Have the state of your application be a persistent data structure (one that you can efficiently update such that both the old and new version remain available, as opposed to in-place mutation). Then hold a "current state" pointer, that gets updated atomically. Endpoints that only read the state can read the pointer and then complete the request with a consistent view of the state, while writes can be serialized, build the new state in the background, and when it's done, swap the pointer to "publish" the new state. This way reads get a consistent view without blocking writes, and writes do not block reads. (Unlike an implementation where you mutate the state in-place, and you would need to protect it with a mutex.) It is possible to do this in non-gc'd languages too, but then persistent data structures are unwieldy, and cloning the full state for every update may be prohibitively expensive.

Lvl999Noob · on Oct 14, 2022

Do you mean this? https://github.com/jonhoo/left-right

I am not sure of the performance or implementation difficulty but the data structure seems to be what you are talking about.

ruuda · on Oct 14, 2022

Not exactly, but it looks similar. With what I was talking about you can have as many versions around as you need.

An example would be using (TVar (HashMap k v)) in Haskell.

yencabulator · on Oct 13, 2022

Even just read-copy-update without refcounts is very hard without GC. Linux kernel can do RCU without refcounts largely because it's in complete control of CPU cores and scheduling; userspace can't pull of the same tricks. Meanwhile, with GC, it's just https://pkg.go.dev/sync/atomic#Value

https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html

tialaramex · on Oct 13, 2022

> Even just read-copy-update without refcounts is very hard without GC

To me "it could be more difficult without" and "requires" are quite different claims, especially in the context of what's possible and why.

throwaway81523 · on Oct 13, 2022

> And I was wondering how can a tracing GC outperform a non-tracing GC memory manager.

The cliché is that malloc/free style memory management has to touch all the garbage in order to free it, while a semispace GC only has to copy the live data once in a while. The garbage is ignored.

pkolaczk · on Oct 13, 2022

However, when a tracing GC runs, it has to touch massive amounts of cold data and pushes hot data out of cache. Traditional malloc/free touches data in small chunks and freeing happens close to the last use, so when most of the data is still hot in cache. Stable, predictable performance is often more important than the peak performance.

masklinn · on Oct 13, 2022

Afaik Go uses a non-moving GC, so it can’t be a semi space.

forrestthewoods · on Oct 13, 2022

The key to garbage collection is… to go super far out of your way to avoid allocating memory. This is not ideal.

ekidd · on Oct 13, 2022

Speaking as someone who has spent time optimizing C++ and Rust, memory allocation in hot loops is often where performance goes to die. GC or not, if you want to fast, reducing allocation is one of the first things to benchmark.

(One fast way to manage allocations is to use an arena allocator that allocates memory by incrementing a pointer, and frees memory all at once. This is pretty effective for simple, short-lived requests.)

forrestthewoods · on Oct 13, 2022

Yes, minimize malloc in all cases. The difference is that GC languages are fundamentally designed around the concept that it’s cheap and easy to malloc/free. Avoiding allocations can be excruciatingly difficult.

In C++ you also need to minimize allocations, but it’s radically easier to do in C++ than in C#.

girvo · on Oct 13, 2022

I'm probably biased because I live in firmware development these days, but that's true even for non-garbage collected languages when it comes to making sure things are fast

metadat · on Oct 13, 2022

Clever ideas to optimize this baby. Nice work.

Hadn't heard of the pre-coalesce millisecond pile up technique.

Favorited, thank you, sincerely!

gpderetta · on Oct 13, 2022

> Hadn't heard of the pre-coalesce millisecond pile up technique.

This is basically Nagling and/or TCP_CORK right?

loosescrews · on Oct 13, 2022

This is similar to Nagle's algorithm (controllable on Linux with TCPNODELAY).

wejick · on Oct 13, 2022

Will this be compatible with other DB using CQL? Like Cassandra itself or Yugabyte for example.

PeterCorless · on Oct 13, 2022

Yes. ScyllaDB writes all of its drivers to be backward/generically compatible with other CQL-based databases like Cassandra, etc.

There are some specific features like shard-aware queries and shard-aware ports that naturally won't apply. But they will work.

usrusr · on Oct 13, 2022

Almost reads like a case where an initial implementation in rust forced a clear mental image of ownership that could then be transferred into another language much easier than it would have been to reach the same clarity outside of pedantic reign of the rust compiler.

philosopher1234 · on Oct 13, 2022

I think this is possible but I see no evidence in the article that would support this interpretation

bborud · on Oct 13, 2022

Somewhat unrelated observation: I have never looked at SchyllaDB so I went to the web page. In the most prominent space they take a dump on the competition. Normally that would be a red flag for me, but in this case it made me curious.

Now I want to know more. :-)

pjmlp · on Oct 13, 2022

Very good write up how performance can be improved without the typical rewrite in X.

erichocean · on Oct 13, 2022

And yet…ScyllaDB is famous for being a 10x faster rewrite of Cassandra (written in Java) in C++.

Your general comment is correct. I see it often with GPU algorithms which, no surprise, are also much faster on CPUs (using something like ISPC to compile them).

pjmlp · on Oct 13, 2022

A performance improvement that could have been obtained by only rewriting in C++ the critical paths and integrate them via JNI, instead of rewriting the world.

An approach that tends to be ignored by those rewrite X in Y.

seastarer · on Oct 13, 2022

ScyllaDB is a rearchitecting of Cassandra, not just a rewrite.

pjmlp · on Oct 13, 2022

The point stands.

hintymad · on Oct 13, 2022

Is DB driver a bottleneck in applications? Somehow I usually see bottlenecks in other places in a service, and the db bottlenecks are usually on the database-side instead on the driver side.

pengaru · on Oct 13, 2022

Sounds like GoCQL and its Rust counterpart are poorly implemented.

Sin2x · on Oct 13, 2022

How does it compare to Clickhouse regarding speed?

PeterCorless · on Oct 13, 2022

Clickhouse is a column store designed for analytics [OLAP] workloads. It would compete with, say, Apache Druid or Apache Pinot.

ScyllaDB is a wide column store which is, in fact, a row store; you can call it a "key-key-value," since it had a partitioning key and a clustering [or "sort" key]. Which is more for transactional workloads [OLTP]. So it is more comparable with Cassandra or DynamoDB.

So they are really designed for different sorts of things.

That being said, ScyllaDB has some features, like Workload Prioritization, so you can run analytics, like range or full table scans against it without hammering your incoming transactions. But it wasn't designed specifically for that.

Sin2x · on Oct 13, 2022

Thanks, for some reason I thought they aim at comparable usecases.

midislack · on Oct 13, 2022

Go d-driver? What the heck? What's this even mean?

neoyagami · on Oct 13, 2022

Is anyone using scyllaDB in production nowdays? I tried some years ago with dataloss and kinda spooked the bejesus out of me(and went to cassandra)

PeterCorless · on Oct 13, 2022

400+ companies. https://www.scylladb.com/users/

henrydark · on Oct 13, 2022

With all this work to remove allocations I get the feeling that what really is needed is C++ with Go's concurrency syntax and runtime

pjmlp · on Oct 13, 2022

It is a myth that C and C++ are free of such issues, it is always a matter of how much one cares about performance.

https://groups.google.com/a/chromium.org/g/chromium-dev/c/EU...

avgcorrection · on Oct 13, 2022

Why C++?