The author is referring to false sharing (https://en.wikipedia.org/wiki/False_sharing). CPU caches operate at cache line granularity (typically 64 bytes) so writes to one part of the cache line can require synchronization with writes to non-overlapping parts of the same cache line. This can dramatically reduce performance when there are a large number of cores operating on the same cache line.
If you remove the 64 byte alignment (which forces each counter variable onto a separate cache line) from hitcounter-shard.c you ought to be able to see the performance difference for yourself.
They very well can be. The CPU mutexes use some form of cache-coherency mechanism based on the MESI protocol - essentially each CPU keeps track of which other cores have a certain piece of memory in their cache, and if any of them modified it.
You can do better than that, for example, I remember reading an article about some game engine, which had a static scheduling of tasks on each frame, so, for example there was no overlap in time between thread A and thread B referencing the same piece of memory. The whole scheme worked the exact same every frame, it was truly free, and there was a zero percent chance of memory inconsistencies.
The "CPU mutex" is just the cache coherency mechanism. If you shard your data to avoid triggering it as suggested, then yes, it's much faster.
EDIT: or maybe you're asking if introducing an explicit userspace mutex is better than a lockless algorithm with false sharing issues. The answer is that it's workload dependent but it definitely can be.
OP > The issue is this will likely go just as slow if not slower. The mere act of sharing the same 64-byte region of memory (a.k.a. cacheline) between multiple cores, causes the CPU internally to basically use a mutex, and chances are the CPU's internal mutexes aren't as good as the ones you've implemented in userspace.
The claim by OP is that "chances are" that userspace mutexes are better than CPU's internal mutexes. So either h/w guys are (for a first) lagging s/w folks and using outdated approaches to creating a mutex in hardware, OR, we somehow must use an inferior approach when implementing a mutex in a CPU, OR, ..
How is it possible that a hardware implementation of an algorithm could be slower than its software variant, and that in "userspace" and not even the kernel.
If you have two threads on different cores that write to the same cacheline, the CPU has to enforce write ordering. The way it does this is for one of the cores to acquire a write lock on the cacheline. The actual implementation of this is via some variant of the MESI protocol that results in the cacheline on one core going to an Exclusive state, while copies of the cacheline on every other core becomes Invalid.
MESI takes a non-trivial amount of time to run. It's usually mediated at the L3 cache, which is frequently clocked slower than the main core, so it's a non-trivial number of cycles for a MESI state transition to happen (think at least ~40 cycles).
Compare that with a cacheline that's already Exclusive on a core: Writing to this cacheline is a pure L1 cache write at ~1 cycle, no MESI involved, no L3 cache involved.
Note that the typical userspace space mutex (some memory location that's modified by compare-exchange) is implicitly relying on MESI whenever two cores race for the lock, or whenever a lock is release on one core and then taken on another, etc etc.
So if your userspace coherency can be handled via some sharding that avoids needing MESI to run, it can go much faster than relying on the CPU's "internal mutex" aka MESI.
To directly address "is it possible that a hardware implementation of an algorithm could be slower than its software variant": The point is that with RSEQ, you can write a _different_ algorithm that's faster than an implementation relying on hardware mediated MESI. Instead of having memory that's accessed for a bunch of different cores, there's a chunk of memory per core, and each core almost always writes only to it's allocated chunk of memory.
> The point is that with RSEQ, you can write a _different_ algorithm that's faster than an implementation relying on hardware mediated MESI.
Even without rseq(2) you're looking at a "different algorithm". For one, a proper userspace mutex allows the excluded threads to be descheduled and the core can be repurposed to run other tasks, rather than hammering away at the contended cache line.
The cache coherency protocols that sit between the CPUs and DRAM always essentially "use a mutex": when a cpu wants to write it broadcasts to all the other CPUs and either gets the latest copy from whoever wrote it last and shoots down any read-only copies in other CPUs or reads it from DRAM (or converts a read only copy to be writeable)
This happens on every memory access, so the thing you want to avoid is ping ponging writeable cache lines between CPUs (especially before you have a chance to actually write it) - LL/SC instructions sit on top of these protocols and allow instructions to tell is a cache line had been "stolen" before you have a chance to write it
I'm not a systems guy, but you're probably right. A mutex requires an atomic variable which triggers the same cache coherency mechanism under contention.
What Justine probably meant was an apples to oranges comparison: in userspace, you can sometimes add constraints that allow for a faster algorithm whereas the cpu generally has to assume the worst.
That being said, if you look through this thread you'll find similar issues with OP's phrasing. A fine engineer / hacker, but too much 4chan troll in my book.
Anyone with an informed opinion on this statement? It's seems counter intuitive (npi).