I posted the following as a comment to the blog, I'll duplicate it here in case ...

sanxiyn · on May 13, 2014

Yup, __builtin_prefetch certainly, if one wants to get it faster. (Compilers can't do this themselves, because they don't know how much they should prefetch ahead, and if you get that wrong it's worse than no prefetching.)

I am less sure about madvise. It seems to me default heuristics should work fine for this case, and as you said, system calls are expensive.

exDM69 · on May 13, 2014

> I am less sure about madvise. It seems to me default heuristics should work fine for this case, and as you said, system calls are expensive.

System calls are expensive but so are page faults or having to access the disk. If you can avoid page faults by using madvise to prefetch from disk to memory, it should be worth it. In particular, the first run with cold caches should be faster.

However, the operating system may be smart and realize that we're doing a sequential access and may speculatively read ahead and madvise calls would be time wasted.

The same happens with CPU caches too, the CPU internal prefetcher is pretty good in recognizing a sequential access and grabbing the next cache line in advance. A few naively placed __builtin_prefetches doesn't seem to help here (I just tried this out).

Prefetching hints work a lot better in non-sequential access patterns (linked lists, etc).

fulafel · on May 13, 2014

Also, TLB misses. Also the reason why mmap(2) doesn't always beat read(2).

dekhn · on May 13, 2014

You know you're using a machine properly when you don't just blow the TLB. You blow the TLB for the TLB. I was pretty skeptical when my coworkers insisted my code did this, until I collected some great Intel performance counter data and, indeed, I blew the 2nd level TLB. Good read: http://www.realworldtech.com/haswell-cpu