If you follow the email thread, you can see that the big math differences are based on libraries, not part of the compiler per se. Not to say they do not need improving.
It's not just the math libraries (although these are indeed much better - especially under Linux), its loop unrolling and vectorization using intrinsics are much better than other compilers in my experience.
However, no compiler I've found can get close to hand-crafted intrinsics in non-trivial cases.
I've heard Intel's compiler makes optimisations that best favour the specifics of Intel CPUs. All other compilers are likely to make CPU-neutral optimisations. Do you think the results would be much different if run on an AMD chip?
Last I heard (according to Agner), Intel's compiler still intentionally incorrectly detects AMD's CPUs and throws them to the CPU-generic code. x264 has a hacked loader function (borrowed from Agner) to avoid this, though I don't know if the test used it.
Now I'm not sure how much this'd actually affect. The autovectorization in Intel's compiler is weak and, at least on Win64, sse2 is allowed in normal code without CPU dispatching. It does affect library functions like math and memcpy, but that'll only matter if your program spends a ton of time in them.
Wasn't there some legal ruling about that? I found some discussions [1] about it. The section about CPU vendor string manipulation is very interesting.
He should have disabled the "Turbo" setting on the processor otherwise results will be somewhat randomized. It can take a while for the processor to switch into turbo mode, and the decision to do so can be based on other work being done. Additionally prior thermal load can result in turbo being disabled until temperatures reduce to normal levels.
Additionally, x264 should probably categorized under "no significant floating point calculations".