Emiller's Balls-Out Guide to Nginx Module Development

viraptor · on March 27, 2010

I got disappointed at the beginning. Not only folding string checking into a tree of letter by letter checks doesn't buy you performance with 4 letters, but also if you bring that as an example of performance improvement in a server you're just showing your ignorance.

Great speed improvement:

    switch (*((uint32*)m)) {
      case 0x54534f50: r->method = NGX_HTTP_POST;

Does it matter now if I spend `O(no_reqests^2)` time looking for the next connection to serve?

jws · on March 27, 2010

So, how long does it take to identify a verb on a 2.66GHz Core2 MacBook Pro…

  FOURCC                         4.9ns   // The parent comment code
  FOURCC-without-alignment-bug   6.1ns   // ... safely assemble the word
  NGINX                         10.3ns   // main ariticle's code
  STRCMP                        17.6ns   // most commen verbs first

The FOURCC codes as in the parent only work if you can assume you know the character after the verb to be a space or other known character because GET and PUT are too short. They also require disambiguating code for the 5+ letter verbs but these are all relatively rare (probably).

My strings are all 4 byte aligned, so the FOURCC case may be getting an unfair boost, it never generates an unaligned read which possibly is slower.

Test data is a set of jumbled GET,HEAD,POST,PUT lookups that I made up to meet my idea of a web server.

There are some worst cases hiding out there, for instance all PROPPATCH verbs…

  FOURCC                        14.7ns
  FOURCC-without-alignment-bug  17.7ns
  NGINX                         19.8ns
  STRCMP                        37.4ns

Feels like gcc sequentially tests case values. Make sure to put them in the order for common verbs to match first. The FOURCC codes get the most control here. NGINX is most consistent, probably from its two layered lookup nature.

So there you have it: nginx's algorithm is significantly better then strcmp, and fairly independent of verb distribution. A FOURCC lookup is faster, even if made safe for all architectures. And finally, strcmp isn't that bad (freebsd anyway, the GNU libc is about half as fast as seen in further testing).

(I am left with a nagging desire to check a perfect hashing function though.)

Edit: gperf is about 50% slower than the nginx method. No need to go that direction for perfect hashing.

viraptor · on March 27, 2010

I'm not sure if you were just curious, or were you trying to prove that nginx is right in optimising this. But since we have the figures - assuming strcmp takes 37ns, we could estimate that wikipedia spent whole 13 seconds parsing the request verb each day in 2008 (assuming 350M visits). Cutting it down by half would definitely be worth the programming time... not.

jws · on March 28, 2010

I was mostly just calbrating my opimization decision making.

I think only considering wikipedia visits is underestimating. A visit is many http transactions, and a webserver is targetted for many times the traffic of wikipedia. Still, it is likely no one ever felt a better user exprience from this one optimization.

But what if all the code is 33% faster than strcmp style code? That would be nice, but also seems unlikely.

My biggest two takeaways:

— beware your OS. The freebsd (mac) strcmp is about twice as fast as the GNU glibc.

— forget about gnu's gperf, it is extra complexity for no benefit. Maybe it would win on very long word lists, but I don't know.

jws · on March 27, 2010

So what about other architectures? How about a 1.6GHz Atom 330 processor. This doesn't have out of order execution and should more closely match ARM and MIPS type small machines. (Sorry, I can't get one of those booted from where I am now.)

  FOURCC                        23.3ns
  FOURCC-without-alignment-bug  27.6ns
  NGINX                         29.8ns
  STRCMP                        70.5ns  (glibc)

The difference between FOURCC and NGINX is vanishing here.

How about a 3GHz core-i7 840

  FOURCC                         4.8ns
  FOURCC-without-alignment-bug   5.6ns
  NGINX                          8.3ns
  STRCMP                        13.9ns  (freebsd)

Not a lot faster than a Core-2, but some. Of course it has twice the cores plus hyperthreading ready to go.

How about if you are cheap (like me) and got a 2.93GHz core i3 for your server:

  FOURCC                         5.3ns
  FOURCC-without-alignment-bug   5.8ns
  NGINX                          9.9ns
  STRCMP                        33.6ns  (glibc)

Not bad for the little guy. (Saved $100 on the cpu, plus didn't need to add a video card with some goofey driver, and its system wide low power state is almost as good at the Atom's.)

Dav3xor · on March 28, 2010

Most ARM's will be significantly slower on non-aligned data.

Un^H^Hfortunately I'm not at work to test. :)

jared314 · on March 27, 2010

That snippet appears to be out of date. Version 0.8.34 does something similar to what you suggest.

  #define ngx_str3Ocmp(m, c0, c1, c2, c3)                                       \
    *(uint32_t *) m == ((c3 << 24) | (c2 << 16) | (c1 << 8) | c0)

yosh · on March 27, 2010

Careful with casting strings to ints like that, it'll break on architectures that require aligned accesses, i.e. if you try to load a uint32, the address of the memory location must be a multiple of 4. x86 happens to be fine with non-aligned access, but many other architectures aren't.

If you look at the nginx source, that #define is guarded under NGX_HAVE_NONALIGNED and there is a fallback that does straight byte by byte comparison.

viraptor · on March 27, 2010

> fallback that does straight byte by byte comparison

That's not even funny anymore... I wonder how many iterations will it take, until they discover that strncmp compiles down to "repz cmpsb" (or equivalent) and you really cannot do better than that with if-s on each character.

oldgregg · on March 27, 2010

I like to think I'm a decent programmer.

Then I read threads like this. :o) Keep learning...

phoboslab · on March 27, 2010

Is this first code snippet, where the HTTP parser checks for "POST" or "COPY" real? It certainly reads like a joke on how not to optimize your code.

Also, wouldn't it be faster to check all four characters at once with a 32bit compare? And what about branch prediction of the CPU?

ccarpenterg · on March 27, 2010

What is a 'branch prediction of the CPU'?

alexgartrell · on March 27, 2010

The CPU has an instruction cache (it reaches out to memory for the series of instructions and grabs a whole bunch at once to be more efficient). When there's a branch (logically, an if statement, a loop, or a goto, or a switch) it has to make a decision about whether or not it should cache the instructions in the if statement or the instructions following the if statement.

Within the linux kernel the macros likely and unlikely are defined so that you can have code of the style

  if(likely(x > 10)) {
    /* These instructions, which probably will be executed,
     * will be cached 
     */
  }

  if(unlikely(ptr == NULL)) {
    /* These instructions, which probably won't need to be 
     * executed, will not be cached
     */
  }

vier · on March 27, 2010

that's exactly what it's doing

_pius · on March 27, 2010

The author later wrote a piece on how this guide allowed him to begin making money as an open source contractor: http://www.evanmiller.org/contracting-advice.html

TimothyFitz · on March 27, 2010

After his contracting he worked with me at IMVU; I learned a lot from him. He left IMVU for the University of Chicago Economics PhD program.

I can't wait to see what he'll do next.

oski · on March 27, 2010

The final version of the guide is here: http://www.evanmiller.org/nginx-modules-guide.html

rmk · on March 27, 2010

The lxr link to the nginx source code is not working... Where could I find it?

tdmackey · on March 27, 2010

google.com

http://nginx.org/en/download.html