This would be great, and is greatly needed. but I have a question for the performance junkies here:
If the cost of atomically setting bits in the mark bitmap turns out to be
high, we could instead dedicate a byte per object for the mark. This idea is
mentioned in GC literature [Garner, 2007]. Obviously, this involves an 8×
increase in memory overhead. It’s likely that on modern hardware, the
cost of the atomic bit operation is small, while the cost of increasing
the cache footprint of the mark structure is probably large.
I have been trained to think of an atomic operation at the CPU level flushing the cache line, and possibly L1 and further impacting all cores sharing the cache. This statement implies the opposite. Can anyone comment on this vis-a-vis a modern intel chip?
This sounds wrong to me, too. For the single core case, there's no such thing as an atomic byte store - the underlying mechanism reads a cache line and modifies it. The same goes for a bit operation (possibly needing more instructions).
For the multi-core case, there's always going to be cache line thrashing regardless of size. Cache lines will evict/transfer between cores as each accesses them.
Probably best to go with single bit, unless it can be shown that the overhead of generating a bit mask and the r/m/w is far more expensive.
An alternative to side bitmaps is a bytemap, a bytegrained
data structure on the side. Bytemaps trade an 8×
space overhead to avoid synchronizing on each set operation
(if we assume support for atomic byte-grained stores).
I think the idea is that, since a bytemap is less dense, two cores less frequently access the same cache lines, leading to fewer invalidations. Though, I wonder if it pays off, since it also gives less locality.
No, it is referring to the fact that the least unit that can be written to memory isn't a bit but a byte. To write a bit the whole byte must be read, modified and then written back.
Isn't the atomic part of that sentence referring to that they are changing bits separately, which is a bit inefficient since most architectures simply doesn't allow that. To set a single bit the writer must read the whole byte, modify it and then write the whole byte back. This is slower than simply writing the whole byte.
That is, I don't think they are referring to the same concept of atomicity as you are.
The point is atomicity doesn't work like this on x86, ARM and most architectures. It's the right conclusion from the wrong analysis. The difference in speed will be the time consumed by managing a packed bitmap, not due to the atomicity being any faster.
So it looks from the diagram the spans point to arbitrary points in the mark bitmaps. Does that mean there's a lookup table to map from span to mark bitmap/offset?