I got disappointed at the beginning. Not only folding string checking into a tree of letter by letter checks doesn't buy you performance with 4 letters, but also if you bring that as an example of performance improvement in a server you're just showing your ignorance.
Great speed improvement:
switch (*((uint32*)m)) {
case 0x54534f50: r->method = NGX_HTTP_POST;
Does it matter now if I spend `O(no_reqests^2)` time looking for the next connection to serve?
So, how long does it take to identify a verb on a 2.66GHz Core2 MacBook Pro…
FOURCC 4.9ns // The parent comment code
FOURCC-without-alignment-bug 6.1ns // ... safely assemble the word
NGINX 10.3ns // main ariticle's code
STRCMP 17.6ns // most commen verbs first
The FOURCC codes as in the parent only work if you can assume you know the character after the verb to be a space or other known character because GET and PUT are too short. They also require disambiguating code for the 5+ letter verbs but these are all relatively rare (probably).
My strings are all 4 byte aligned, so the FOURCC case may be getting an unfair boost, it never generates an unaligned read which possibly is slower.
Test data is a set of jumbled GET,HEAD,POST,PUT lookups that I made up to meet my idea of a web server.
There are some worst cases hiding out there, for instance all PROPPATCH verbs…
Feels like gcc sequentially tests case values. Make sure to put them in the order for common verbs to match first. The FOURCC codes get the most control here. NGINX is most consistent, probably from its two layered lookup nature.
So there you have it: nginx's algorithm is significantly better then strcmp, and fairly independent of verb distribution. A FOURCC lookup is faster, even if made safe for all architectures. And finally, strcmp isn't that bad (freebsd anyway, the GNU libc is about half as fast as seen in further testing).
(I am left with a nagging desire to check a perfect hashing function though.)
Edit: gperf is about 50% slower than the nginx method. No need to go that direction for perfect hashing.
I'm not sure if you were just curious, or were you trying to prove that nginx is right in optimising this. But since we have the figures - assuming strcmp takes 37ns, we could estimate that wikipedia spent whole 13 seconds parsing the request verb each day in 2008 (assuming 350M visits). Cutting it down by half would definitely be worth the programming time... not.
I was mostly just calbrating my opimization decision making.
I think only considering wikipedia visits is underestimating. A visit is many http transactions, and a webserver is targetted for many times the traffic of wikipedia. Still, it is likely no one ever felt a better user exprience from this one optimization.
But what if all the code is 33% faster than strcmp style code? That would be nice, but also seems unlikely.
My biggest two takeaways:
— beware your OS. The freebsd (mac) strcmp is about twice as fast as the GNU glibc.
— forget about gnu's gperf, it is extra complexity for no benefit. Maybe it would win on very long word lists, but I don't know.
So what about other architectures? How about a 1.6GHz Atom 330 processor. This doesn't have out of order execution and should more closely match ARM and MIPS type small machines. (Sorry, I can't get one of those booted from where I am now.)
Not bad for the little guy. (Saved $100 on the cpu, plus didn't need to add a video card with some goofey driver, and its system wide low power state is almost as good at the Atom's.)
Careful with casting strings to ints like that, it'll break on architectures that require aligned accesses, i.e. if you try to load a uint32, the address of the memory location must be a multiple of 4. x86 happens to be fine with non-aligned access, but many other architectures aren't.
If you look at the nginx source, that #define is guarded under NGX_HAVE_NONALIGNED and there is a fallback that does straight byte by byte comparison.
> fallback that does straight byte by byte comparison
That's not even funny anymore... I wonder how many iterations will it take, until they discover that strncmp compiles down to "repz cmpsb" (or equivalent) and you really cannot do better than that with if-s on each character.
The CPU has an instruction cache (it reaches out to memory for the series of instructions and grabs a whole bunch at once to be more efficient). When there's a branch (logically, an if statement, a loop, or a goto, or a switch) it has to make a decision about whether or not it should cache the instructions in the if statement or the instructions following the if statement.
Within the linux kernel the macros likely and unlikely are defined so that you can have code of the style
if(likely(x > 10)) {
/* These instructions, which probably will be executed,
* will be cached
*/
}
if(unlikely(ptr == NULL)) {
/* These instructions, which probably won't need to be
* executed, will not be cached
*/
}
Great speed improvement:
Does it matter now if I spend `O(no_reqests^2)` time looking for the next connection to serve?