Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How We Designed for Performance and Scale (nginx.com)
293 points by fcambus on June 10, 2015 | hide | past | favorite | 62 comments


Btw there is a new feature in the kernel to help avoid using the accept shared memory mutex -- EPOLLEXCLUSIVE and EPOLLROUNDROBIN

This should round robin accept in the kernel, and not wake up all the epoll listeners.

https://lwn.net/Articles/632590/


Isn't this exactly what EPOLLET (edge triggering) does?


Edge vs level triggered refers to how a change in an fd's state is reflected in userspace: level triggering will cause repeated wakeups while the condition remains true, whereas edge triggering will cause exactly one, until the monitored state flips from true to false and back again.

The new flags relate to what happens when multiple threads are waiting on the same set of file descriptors (i.e. sleeping on the same epoll FD passed as the first parameter to epoll_wait(), or having the same client FD in multiple epoll sets -- sorry I'm not sure which way around it is).

Previously the kernel had no support for waking exactly one thread to handle one event, so if there was a single set shared among a bunch of sleeping tasks, all tasks would be scheduled, causing (presumably) synchronization contention on the kernel's internal structures. The new mode ensures only a single task is woken up for an event on a single FD, even when multiple tasks are waiting on it.


This is a lovely article, but:

> The fundamental basis of any Unix application is the thread or process. (From the Linux OS perspective, threads and processes are mostly identical; the major difference is the degree to which they share memory.)

It's better to be specific in performance discussions, rather than use 'thread' and 'process' interchangeably.

As well as the article mentioned about memory sharing, threads (which are called Lightweight Processes, or LWPs, in Linux 'ps') are granular.

    ps -eLf
NWLP in the command above is 'number of lightweight processes', ie number of threads.

Processes are not granular: they're one or many threads. IIRC it can be beneficial to assign threads of the same process to the same physical core or same die for cache affinity. There's all kind of performance stuff where 'threads' and 'processes' do not mean the same thing. Being specific is rad.


In Linux, they are both just entries in the process table. They are created by the `clone` syscall, and the "normal" way of creating them share different amounts of resources by _default_.

You're right to say that treating them differently can be beneficial in some situations, but it really depends.


Yep. I guess the most basic case is: "is this single process multithreaded, as I have a multicore machine and there's only one PID"


Aside: you really nailed the "I think they're saying something significantly incorrect but don't want to be a jerk about it" tone. Kudos.


> You can reload configuration multiple times per second (and many NGINX users do exactly that)

I thought this was an interesting remark. Can anyone clue me in to what these "many users" might be doing, that requires them to reload configuration so frequently?


Service Discovery. Dynamically generating routing rules so microservices can have pretty URLs. For example with us when you deploy a new version of a microservice it starts on a random port, registers with Consul on startup and then dynamically regenerates and reloads Nginx.


I know of at least one VPS company that build its load balancer SaaS offering on NGINX and it automatically reloads whenever a node is added/removed.

So I would assume it was that function that would cause such rapid reloads. [e.g. If you had a 50 node pool, you might go +/-5 over a span of a second and change the config 5 times in 1s]


I front my docker fleet with Nginx and I need to reload config for new hosts


Dynamically generating urls, like tumblr or github pages? (just a guess)


"NGINX’s binary upgrade process achieves the holy grail of high-availability; you can upgrade the software on the fly, without any dropped connections, downtime or interruption in service."

Is this really true? I remember seeing an article[1] recently on using an iptables hack to prevent dropping connections when reloading haproxy. Does nginx actually provide zero-downtime configuration reloads?

[1] https://medium.com/@Drew_Stokes/actual-zero-downtime-with-ha...


I didn't want to be a negative nancy in the comments for that haproxy article... but that is a ridiculous ugly hack. It's really a lot easier to achieve real and robust zero-downtime upgrades for a simple unix process.

Remember, fork()ed and exec()ed processes inherit file descriptors (except those marked CLOEXEC), including the listen() fd. Pending connections will queue in the kernel until userspace calls accept() on the listening fd.

So one simple model is to stop calling accept(), cleanly/quickly finish up current connections, set an environment variable to tell the future instance that the listening fd X is already open, and exec your own binary again.

A more complicated one is to fork, have the (identical) child just finish the current connections, while the parent execs itself similar to above. (The client connection fds should be marked CLOEXEC in this case.)

With a more complicated service with more moving parts, libraries, threads, getting the above to work out is more complicated. But that's basically how you want to do it.


I program all my server programs to have a --listen-fd=N command line option. That way, the process that starts the server is responsible for creating the listening socket, and it can create multiple server instances using the same socket. The servers can then be programmed to handle SIGUSR1 by no longer calling accept, and terminating when the last client exits. This way, the amount of code required in each server program is tiny; less than 15 lines, with no calls to fork or execve.

systemd already supports creating listening sockets, but its mode of operation is more similar to inetd. I think supporting something like what my proprietary launcher program does would require very little changes.

Another advantage of passing the listening socket like this is that the server process can be in a private network namespace, without requiring any type of NAT setup or port mapping.


Yes, nginx allows you to upgrade the binary and reload the configuration without any downtime: http://wiki.nginx.org/CommandLine#Upgrading_To_a_New_Binary_...


Does this work when you update with package manager?


Yes.


The holy grail of high-availability isn't upgrading stateless software. It's upgrading stateful ones.

Like upgrading when a data structure changes between versions. The HTTP protocol nginx serves is stateless and by comparison far simpler. Same goes for Erlang. It offers nothing more than simple function replacement, and that's not enough to handle data structure changes either.


In Erlang, if you need a new data structure for your state, you can check if your state is old, and upgrade it and continue on. In a gen_server, you might have something like

  handle_call(Request, From, State) when is_record(State, state) -> handle_call(Request, From, upgrade_state(State);
  handle_call(Request, From, State) when is_record(State, state2) -> ...
(you'll want to do something similar on handle_cast and handle_info if you use those). You have to do a little work, but I don't see how you avoid that?


That's not guaranteed to be safe.

Having new code check if your state is old and upgrading isn't enough. You also need to check old code doesn't process new state. That becomes harder under concurrency.

http://en.wikipedia.org/wiki/Dynamic_software_updating#Updat...

Erlang doesn't offer this check.

"Old code may still be evaluated because of processes lingering in the old code."

http://www.erlang.org/doc/reference_manual/code_loading.html...


If you write the code within the `gen_server` guidelines, state migration is supported by `code_change`:

http://www.erlang.org/doc/man/gen_server.html#Module:code_ch...

For example:

http://stackoverflow.com/questions/1840717/achieving-code-sw...

BTW, You can even support downgrade. :)


> Having new code check if your state is old and upgrading isn't enough. You also need to check old code doesn't process new state. That becomes harder under concurrency.

If we're talking about a gen_server, the state is per process, and once the process has switched to the new code, it won't go back, so there's no problem with old code and new state. In non gen_server code, you do need to be careful about when you hit a boundary that gets you into new code; you'd typically want it to be your process's main loop, since that usually tail recurses and doesn't leave a stack in the old code. It is difficult to reason about a situation where you call into new code, and that returns to old code; it's much better to avoid it.

The concurrent case is OK too, each process manages its own state, and upgrades it when it switches to new code. Are you thinking about changes to messages that are being passed and/or global state? In that case, like with any distributed system, you need to load in stages: first load code that can handle old and new messages, then trigger sending new messages (code load or config setting), then load code that only handles new messages.


You touched on most points, primarily avoiding the situation of reasoning about concurrent old and new code. I didn't know a gen_server manages code loading like this, thank you for that.

An upgrade doesn't involve only the in-memory state per process though. It also involves state outside the process, like state on disk. Even if each process upgrades it's own state (I'm assuming the gen_server isn't limited to in-memory state; I don't know), an old process accessing from disk a data structure that differs from the one used by the new process isn't safe. You can't just upgrade old processes in stages.

An upgrade can also involve multiple processes. It's hard to upgrade all of them at once. As you mentioned, in the hardest case of all, a distributed system, loading in stages may be the only option, provided the system was explicitly designed such that old and new processes can coexist without safety issues.

http://pmg.csail.mit.edu/upgrades/


Synchronize to a checkpoint. Serialize all state. Run data structure upgrading code. Deserialize upgraded state.

Of course possible to do in-place, but I'd imagine serialization makes testing it easier. Not to mention sending error reports if something goes wrong. Having all that state in a bug report could help a bit!


What about running websockets?


Yes, it's true. NGINX does zero-downtime configuration reloads and binary upgrades since 2004 without any dirty hacks.


The irc client irssi has something like this in /upgrade where it spawns a new binary but passing along active socket connections and their associated irc room states, I believe


WeeChat definitely does this. I don't know whether irssi supports it as well. Too bad WeeChat can't keep TLS connections open on /upgrade though


yeah weechat is one of the rare C apps to actually preserve state on upgrade (passes fds, saves state, load in new process, kill old process)

its not that hard to do if the program is made for this, even in C, its pretty hard to add when it isnt tho since you need a compat. layer.

Sometimes the new program simply has import functions for the old memory layout.


If it's the holy grail, then erlang has had it for a couple decades... and whether it actually works in nginx I don't know, but it does work in erlang.

The way it's handled is that the process where the previous socket was connected remains in place until it terminates but all new sockets connect to the new code.


"The way it's handled is that the process where the previous socket was connected remains in place until it terminates but all new sockets connect to the new code."

That would imply that two separate processes are bound to the same port right? I thought that was not possible.


No, usually there's an accept process spawning working processes and passing the connection off to it.

Erlang can have two versions of a module present in memory at a time. At the moment of a code upgrade, the processes working with existing connections continue to use the old code while newly accepted connections are passed to the version of the module that was loaded in between connection accepts.


Keep in mind that Erlang processes != System processes. Erlang processes are preemptively scheduled microthreads that are scheduled by the Erlang VM.


Things have changed :)

https://lwn.net/Articles/542629/


If only the JVM could take advantage of this.


Just about bet you could do that via some grotesque JNI work, but I wouldn't wish JNI on really anyone.


While there are certain socket options to share a port - two processes can share a port easily in another elegant way; just call fork().

One way of doing such upgrades is then for the existing code to fork() then exec() the new version. The existing old code closes the original server socket, which leaves accepting new requests to the new code. When all in-flight requests served by the old code is finished, that process can exit.


Presumably the original server process uses exec() to start the new process, which inherits the open socket handles, and the descriptor is passed as an argument to the new process.

(e.g. http://stackoverflow.com/questions/14351147/perl-passing-an-...)


This is known to me as a "lame duck" mode, and I thought it was a standard thing to do!


One thread per CPU, and non-blocking I/O, that's sounds like the usual way to approach the problem. I'm surprised it uses state machines to handle the non-blocking I/O, because modern software engineering provides much more pleasant approaches such as using coroutines.


Coroutines have the distinct disadvantage of needing a stack, much like threads. So-called 'stackless' coroutines aren't really so different to computed gotos in a state machine


stackless are regular coroutines except that there's 1 stack in the scheduler and when a coroutine yields, the current stack state is saved into the coroutine's structure to be restored later. The disadvantage is that you'll need to copy out/in every time a coroutine yields.

It's unrelated to gotos in a state machine.

Having a stack per coroutine is not a big deal, especially if the coroutine library regularly advises the kernel on the memory areas it isn't using (madvise).

What's stored on the stack usually needs to be stored somewhere else and ends up using a similar account of memory.

When you allocate a 1MB stack per coroutine, the kernel will not wire it all to ram, but only the pages that have been touched. When the stack shrinks back and the coroutine yields, the scheduler can call madvise and inform the kernel that the no longer used pages from the 1MB can be reclaimed.


For parsing html, you wouldn't need that large a stack. Further, you could make the stack grow dynamically at a very small cost. Also, I bet that even if stackless coroutines are close to FSMs, they are much easier to program.


You need green/user-space threads if you want to use co-routines effectively, because if a coroutine blocks, it would block off the entire OS thread from doing anything else.


What are stackless coroutines? What do they look like?


Read Adam Dunkels' paper on Protothreads.


Thanks for your comment.

I don't know how to feel about Protothreads. It's just syntatic sugar for using a state-machine to provide continuations. You're also not allowed to carry state over between continuations (although I can see how you can extend it to carry-over some struct of data or something). This greatly diminishes the usefulness of Protothreads.. so it feels more like a fun proof-of-concept.

Are there any popular/real-world use cases of Protothreads?


Yeah, that's how you end up doing this: store coroutine state in a struct, with an integer that lets you decide where to jump to when you resume the coroutine. With compiler support that can be pretty efficient (gcc has computed goto, for instance). Things get quite awkward in C but similar techniques are manageable in C++.

I've never used Protothreads before, but Contiki (the tiny operating system written by the same person) uses them for its process implementation.

Similar technique is also used by Putty as well: http://www.chiark.greenend.org.uk/~sgtatham/coroutines.html


nginx is written in C though - I know it's possible to do coroutine-type stuff with setjmp/longjmp, but isn't that considered risky?


setjmp()/longjmp() will work, but they're sort of inefficient as, at least under POSIX, they will save and restore the signal mask, which makes for two round-trips to the kernel just for a coroutine context switch.

My web server uses coroutines, and for x86 and x86-64 it uses open-coded assembly routines to yield/resume, with fallbacks to setjmp()/longjmp() on other architectures.

It works fairly well, performance-wise. In fact, it's one of the top-performing servers/frameworks in the TechEmpower's Web Framework benchmarks[1].

I wrote a similar article explaining how everything is put together here[2].

[1] https://www.techempower.com/benchmarks/ [2] http://tia.mat.br/blog/html/2014/10/06/life_of_a_http_reques...


Given the various efforts like libtask, lthread, boost coroutines, etc, it seems like the low-level assembly trickery ought to be isolated and standardized. Maybe some new methods (similar to the setcontext family) should be proposed to the glibc project. Something not at risk of deprecation.


The Boost guys factored the context switching and stack allocation stuff in to a separate library - Boost Context[0], which supports a bunch of architectures[1]. I'm sure their fcontext code could be lifted. They claim on modern x86-64 that a switch takes about 8ns[2]

The Boost Coroutine library is built on top, adds type safety, ensures the stack is unwound when contexts are destroyed, and enables propagation of exceptions across switches.

    $ ls -lh /usr/lib/libboost_context.so.1.58.0 
    -rwxr-xr-x 1 root root 55K May 30 09:58 /usr/lib/libboost_context.so.1.58.0
[0] http://www.boost.org/doc/libs/1_58_0/libs/context/doc/html/i...

[1] http://www.boost.org/doc/libs/1_58_0/libs/context/doc/html/c...

[2] http://www.boost.org/doc/libs/1_58_0/libs/context/doc/html/c...


Not sure if something like this should be part of glibc. Just recently there was an ABI break due to the fact that jmp_buf is exposed in the headers to allow embedding the struct[1].

[1] https://lwn.net/Articles/605607/

(Also, I ended up mixing up ucontext.h with setjmp.h in my comment above; Lwan uses ucontext.h as a fallback. There are coroutine implementations that will use setjmp/longjmp, or at the very least reuse the jmp_buf struct and roll their own asm, though.)


Glibc seems like a good place to put it to me. If everyone is preferring hand-coding assembly primitives that workaround suboptimal standard methods (ucontext), then my first instinct would be to look into making those existing methods more optimal.

(Note: in case it wasn't clear, I'm not suggesting we put all of coroutines into glibc, just the stack dancing stuff)


I did as they said at the bottom and gave them my e-mail and other personal details so I could download the eBook that they were giving free preview copies of - "Building Microservices". Unfortunately, they sent link to PDF only so it's not usable to me. Just a heads up to others so you save yourself the time of discovering that. (I'll just wait for when the book is finished and then I'll buy it so I get ePub. I like O'Reilly and have bought many books there before.)


If you have a kindle you can send a pdf to your kindle email with the subject 'convert' -

http://www.amazon.com/gp/sendtokindle/email


The book is out now from O-Reilly - its a good book.


I had a feeling I've read about nginx before: http://aosabook.org/en/nginx.html

The whole book is worth a read, although I found some sections painfully boring (perhaps my limited attention span is to blame).


There's a 3rd volume "The Performance of Open Source Applications" now, and it has a chapter on another high performance HTTP server, Warp:

http://www.aosabook.org/en/posa/warp.html

Interesting what kind of performance one can get out of GHC nowadays. Article says the authors of Warp had to implement a new parallel IO manager for GHC to get there, but that was merged into GHC 7.8.


Interesting overview. I wish they had some data comparison which could explain the significance and efficiency of this approach vs other/old approaches.


They forgot to mention pool-allocated buffers, zero-copy strings, and very clean, layered codebase - every syscall was counted.

The original nginx is a rare example of what is the best in software engineering - deep understanding of principles and almost Asperger's attention to details (which is obviously good). Its success is justified.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: