Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I posted the following as a comment to the blog, I'll duplicate it here in case someone wants to discuss:

This program is so easy on the CPU that it should be entirely limited by memory bandwidth and the CPU should be pretty much idle. The theoretical upper limit ("speed of light") should be around 50 gigabytes per second for modern CPU and memory.

In order to get closer to the SOL figure, try adding hints for prefetching the data closer to the CPU. Use mmap and give the operating system hints to load the data from disk to memory using madvise and/or posix_fadvise. This should probably be done once per a big chunk (several megabytes) because the system calls are so expensive.

Then try to make sure that the data is as close to the CPU as possible, preferably in the first level of the cache hierarchy. This is done with prefetching instructions (the "streaming" part of SSE that everyone always forgets). For GCC/Clang, you could use __builtin_prefetch. This should be done for several cache lines ahead because the time to actually process the data should be next to nothing compared to fetching stuff from the caches.

Because this is limited on memory bandwidth, it should be possible to do some more computation for the same price. So while you're at it, you can compute the sum, the product, a CRC sum, a hash value (perhaps with several hash functions) at the same cost (if you count only time and exclude the power consumption of the CPU).



Yup, __builtin_prefetch certainly, if one wants to get it faster. (Compilers can't do this themselves, because they don't know how much they should prefetch ahead, and if you get that wrong it's worse than no prefetching.)

I am less sure about madvise. It seems to me default heuristics should work fine for this case, and as you said, system calls are expensive.


> I am less sure about madvise. It seems to me default heuristics should work fine for this case, and as you said, system calls are expensive.

System calls are expensive but so are page faults or having to access the disk. If you can avoid page faults by using madvise to prefetch from disk to memory, it should be worth it. In particular, the first run with cold caches should be faster.

However, the operating system may be smart and realize that we're doing a sequential access and may speculatively read ahead and madvise calls would be time wasted.

The same happens with CPU caches too, the CPU internal prefetcher is pretty good in recognizing a sequential access and grabbing the next cache line in advance. A few naively placed __builtin_prefetches doesn't seem to help here (I just tried this out).

Prefetching hints work a lot better in non-sequential access patterns (linked lists, etc).


Also, TLB misses. Also the reason why mmap(2) doesn't always beat read(2).


You know you're using a machine properly when you don't just blow the TLB. You blow the TLB for the TLB. I was pretty skeptical when my coworkers insisted my code did this, until I collected some great Intel performance counter data and, indeed, I blew the 2nd level TLB. Good read: http://www.realworldtech.com/haswell-cpu




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: