If you'd have to implement the hash table every time you want to use one, you'd ...

kragen · on March 26, 2021

One of the cool nuances of hash tables is that you can exploit the nonrandomness of your data to get a more uniform distribution of keys over hash buckets than you would get with an ideal random-oracle hash function. For example, it's very common to use hash functions which will produce four consecutive output values (possibly shuffled) for string keys "r0", "r1", "r2", and "r3", which guarantees that those keys will not collide with each other. Hashpjw is one such widely-used hash function; FNV is another. (Better not use linear probing with such a hash function, though! And it's probably better not to use them in cases where the keys might be chosen adversarially to produce a DoS.)

iforgotpassword · on March 26, 2021

> Quite often you data is not as randomly distributed as you may initially think it is.

That's why it's important to pick the right hash function. A hash table usually goes a long way before it becomes a performance issue.

jolux · on March 26, 2021

What?

selljamhere · on March 26, 2021

I think the parent comment is highlighting that a hash table is just one implementation of a "Map" abstract data type. Depending on the use case, other implementations (such as using a heap, in the C++ STL) may have better performance.

jolux · on March 26, 2021

True, but as far as I know it's rarely the worst choice you can make, and it's plenty fast for a lot of applications. I don't know that there's a single data structure that's always better, which is what the GP seems to be implying.

rurban · on March 27, 2021

The STL is always the wrong solution for performance.

set: the worst. unordered_set: the worst. vector: no small vector optimizations. string: horrible. array: use small_vector instead. deque: slow. list and forward_list: nobody uses that for performance.

wffurr · on March 26, 2021

Or for the example of word frequencies in the article, a prefix trie.

ziml77 · on March 27, 2021

I tried a trie with data that I thought would work perfectly with it. There were a good amount of shared prefixes. But it ended up performing worse than a hash map. I don't remember all the details now, but after seeing that I'll just keep reaching for hash maps first (unless I'm sure that there's only going to be a very small number of elements, then I'll just iterate in order over a vector).