You had me until you said Asynchronous I/O is faster than Synchronous I/O in Jav...

alpeb · on May 15, 2014

The Rob Van Behren slide is priceless: dude started building an async server and at the end realised he wrote the foundation for a threading package. I once asked myself the same question and did a little micro benchmark https://github.com/alpeb/io_benchmarks concluding IO was faster. Even so, I ended up doing all my stuff in Scala+Play because it's an immense pleasure and it's so easy to scale.

lbarrow · on May 16, 2014

I worked with rvb closely last year and we always used to make fun of him whenever this debate came up. Then one day we realized he had co-authored a paper with Eric Brewer on the topic: https://www.usenix.org/legacy/events/hotos03/tech/full_paper...

xxs · on May 15, 2014

It's highly cited "paper" and just as wrong. Blocking I/O is implemented via poll(), NIO is epoll. Blocking IO has to copy the array at last once (on the stack for <64kb, else malloc/free). Most NIO implementations use heap ByteBuffers and multitude of copies which is their downfall.

Blocking IO cannot have predictable latency under load at very least, you are left at the mercy of OS thread scheduler. Due to various reasons (e.g. mutator threads should not block GC and compiler ones) thread priorities are honored.

I'd argue that a well written NIO (and virtually there are no good open source NIO impl) will beat flat out any blocking. NIO is both faster and offers better/predictable latency under load.

MrBuddyCasino · on May 16, 2014

Ok, now I'm curious: what is a good implementation? Also, whats wrong with ByteBuffers? I was under the impression that they are usually memory-mapped and should be 0-copy.

xxs · on May 16, 2014

There are 2+1 major types of ByteBuffers: Heap- backed by byte[] (or char[], int[], etc) Direct: backed by C memory allocated via mmap (on linux). mmap can map to the RAM or a file. Memory mapped files are not an interesting case for NIO impl that works with sockets (On a flip note: FileChannel.transferTo(SocketChannel) doesn't involve memory mapping when the kernel supports it. Windows never supports it, though)

Most impl. use heap ByteBuffer, then parsing requires state machines and often they are simplified by copying the buffers. The blocking IO doesn't really need a state machine as the stack serves that purpose. Then there is some reactive alike pattern (submitting tasks to an ExecutorService) that costs some more latency. Certainly, it's easier to work with and reason about, yet the more hand-outs there are the worse the performance/latency is. There are minor issues like the choice of a good queue. It is an important one as java lacks MultiProducer/SingleConsumer queues out of the box, or even single producer/single consumer. Java does have MP/MC queues (CLQ is an outstanding one) but one has to pay some extra price (incl. false sharing sometimes) to use them.

Ultimately the blocking IO cannot be "faster" than NIO per se since under the hood it uses poll(2)[0] with one socket. Before that it copies the java byte[] to a new location - for smaller byte[] it's the stack. Technically one can blow up the JVM if the stack is very tiny while entering socket.getOutputStream().write(byte[])

Lastly Selector.wakeup() has a stupid issue that involves entering a synchronized block each time even if there is an outstanding wake-up request already. Wakeup requests are implemented via pipes on linux (and a socket pair on Windows) that requires kernel mode switch. During the wakeup all the threads attempting to carry the task block on that very selector for no real reason. It can be played around with a CAS, so only one thread actually enters the monitor.

I will repeat myself blocking IO doesn't have predictable latency and cannot be enforced. In the end it's all about the latency as bandwidth can be bought, more machines deployed but you can't buy latency.

[0]http://linux.die.net/man/2/poll

Some internal stuff on heap vs direct buffer: http://stackoverflow.com/a/11004231

MrBuddyCasino · on May 16, 2014

Ok, I was aware of the difference between Direct vs. Heap ByteBuffers, and I guess I understand the argument about poll/epoll. Now what I don't quite get is why most open source projects chose to use the slower implementation. Don't they know any better? Is it a portability issue? Bugs in certain JVM/OS combinations? Netty.io claims to be 0-copy capable, so I guess that this must be one of the good ones that are available?

xxs · on May 16, 2014

After seeing the message I've decided to check netty.io's code and I am pleasantly surprised. It has been ages since I checked the project. They use almost all tricks in the book - CAS around selector.wake(), handling the zero returned keys,ref. counting buffers allocator, even a SC/MP queue.

Only couple of downsides: 1) there appears to be the lack of bounded queues and it's a non-trivial one. Bounded queues are important to ensure proper back-pressure on 'producers' and/or killing slow peers. 2) encoding pipeline may require serializing the same message multiple times when sending to multiple clients even if the serialization results into the same byte stream. However this is really a minor issue.

Like I've said I'm pleasantly surprised.

kasey_junk · on May 15, 2014

This article comes up every time someone mentions I/O "speed". You have to be very careful with making blanket claims in either direction.

These tests were a particular kind of network traffic, and were testing for max throughput across many connections. Minimizing latency, smaller message sizes, and/or fewer connections can make the decision to use NIO vs standard IO libraries come down in different places. (Not to mention that you cannot program the standard IO libraries in a no alloc form).

pron · on May 15, 2014

That's not what I said, or at least not what I meant. If you have blocking operations (that take a long time), then thread-blocking (as opposed to fiber-blocking or async) IO will require too many threads.

cpprototypes · on May 15, 2014

I've heard of Quasar before and had a general idea of what it is, but didn't look at the documentation carefully until now. My understanding is that I can run arbitrary synchronous code in Fibers? For example, consider the MongoDB client library:

  DBObject r = collection.find(query);

It blocks while getting the results of the query. If I do something like this:

  for (int x=0; x < 100000; x++) {
    new Thread(() -> DBObject r = collection.find(query))).start();
  }

it's going to start 100,000 threads and freeze my computer. However, with Fibers I can do:

  for (int x=0; x < 100000; x++) {
    new Fiber(() -> DBObject r = collection.find(query))).start();
  }

and it will work fine since these are lightweight threads (like Go goroutines). I guess my main question is, can I use arbitrary unmodified synchronous code like this to run in Fibers or would the library have to be modified to support it? In this case, would someone have to update MongoDB library to add support for Fibers?

eitany · on May 15, 2014

Hi. You don't have to change the library in order to make it work through fibers. You have to wrap it. In case it has efficient implementation of the async-api you have to implement the fiber-synchronous using the asynchronous api. If it hasn't you have to wrap it with threadpool. You can take a look in the implementation of the JDBC wrapper here: https://github.com/puniverse/comsat/tree/master/comsat-jdbc

pron · on May 16, 2014

The short answer is that you can't just use any blocking code. The Comsat project (https://github.com/puniverse/comsat) contains integrations of popular libraries with Quasar fibers, without change to their API. Under the hood, this is done using transformations made available in the FiberAsync class. Like eitany said, you can use FiberAsync to integrate the library yourself, or wait for an integration module in Comsat.

If library provides asynchronous APIs, it would yield great performance when integrated with Quasar fibers. If not (like JDBC), it would work as well as it does on regular threads, but won't interfere with all the other great stuff fibers can do.

zinxq · on May 15, 2014

Agreed - for example a (canonical example) chat application (long lived connections, small amounts of data). But that's still not performance, that's then scalability? As in "how many active chat sessions could one server handle?".

The opposite canonical example being a static, small-file web server. Short connections. How many files can you serve per second?