The advantage of tombstones is that it is dead simple to implement. Backshift re...

attractivechaos · on March 27, 2021

Backshift has two potential disadvantages: 1) copying objects may be slower than testing equality (case dependent); 2) each deletion involves the entire probe chain (with tombstone, you inspect half of the probe chain in average). On the other hand, with tombstone, each probe chain is longer. This slows down searching and insertion as well as deletion. Tombstones also increase the frequency of rehashing.

> The advantage of tombstones is that it is dead simple to implement.

Backshift is harder to understand but it actually takes fewer LOC to implement at my hand. That is because insertion and searching become simpler (and slightly faster) with backshift.

> Backshift removals can cause a lot of memory shuffling

Not sure what you mean by memory shuffling. Backshift is applied to linear probing only. You don't jump around except when you wrap around the end of the bucket array, which is rare.

bjourne · on March 27, 2021

> Backshift has two potential disadvantages: 1) copying objects may be slower than testing equality (case dependent);

I would wager that is true in most cases.

> This slows down searching and insertion as well as deletion. Tombstones also increase the frequency of rehashing.

Yes, but you could argue that is an advantage; the hash table reaches its "natural size" faster.

> Backshift is harder to understand but it actually takes fewer LOC to implement at my hand. That is because insertion and searching become simpler (and slightly faster) with backshift.

I looked at your code and, afaict, it only supports linear probing. Backshift removals would be much harder or maybe even impossible to implement with quadratic probing or double hashing.

attractivechaos · on March 27, 2021

> I would wager that is true in most cases.

Strings. In C, comparing two strings involves a function call, a loop and a potential cache miss. Copying a string just moves a pointer, which is much cheaper. The additional copying costs little for strings.

> the hash table reaches its "natural size" faster.

Not sure what you mean by this. The "nature size" is the size without tombstones. Clearly backshift is the better one.

> it only supports linear probing

Yes, backshift only supports linear probing so far as I know. Most high-performance hash table libraries use linear probing these days. A few use quadratic probing. I rarely see double hashing.

Anyway, backshift always uses less memory and is always faster on insertions and searching. Under certain deletion load, backshift may be slower due to 1) and 2), but the difference is fairly small considering the save on memory.

bjourne · on March 28, 2021

> Strings. In C, comparing two strings involves a function call, a loop and a potential cache miss. Copying a string just moves a pointer, which is much cheaper. The additional copying costs little for strings.

A better approach would be to store the string's hashcode somewhere and only perform the element-wise comparison in case they match. For most strings they are unlikely to match.

> Not sure what you mean by this. The "nature size" is the size without tombstones. Clearly backshift is the better one.

You can count the fraction of the table consumed by tombstones and grow it a little earlier if that fraction is significant.

> Yes, backshift only supports linear probing so far as I know. Most high-performance hash table libraries use linear probing these days. A few use quadratic probing. I rarely see double hashing.

Yes, but you are trading amazing performance in the average case against awful performance in the degenerate case. Poor hash functions or adversarial input could cause congestion in parts of the hash table.

attractivechaos · on March 30, 2021

> Yes, but you are trading amazing performance in the average case against awful performance in the degenerate case. Poor hash functions or adversarial input could cause congestion in parts of the hash table.

It is not just me. As I said, most high-performance hash tables these days (absl, F14, and all robin hood implementations) use linear probing. Providing a proper default hash function avoids most worse cases in practice. In addition, the hash library can use a secondary hash function [1] which makes the library much more robust to bad hash functions such as the identity hash function.

[1] https://attractivechaos.wordpress.com/2018/10/01/advanced-te...

rurban · on March 28, 2021

Since you need to do collision counting for security reasons anyway, it's trivial the check the backshift threshold while searching for your key. If you got too many tombstones in your chain, backshift.

tirrex · on March 27, 2021

If you are using load factor around %75, it means you’ll do 3 shifts at most. On average it is less than it as you are ordering your map on each removal and only case you’ll do backshifts when your items are “not ordered according to hash function”.

Tombstones are silly, you do add/remove %75 of your map capacity, now you have to rehash your map even your map is empty. I can’t think a single scenario that tombstones can perform better. It causes much more probing and much more cache misses than backshift deletion.