All the points are valid but they are peculiar to Java, not to all managed high-...

kg · on Jan 12, 2012

It's also worth pointing out that the C# port of SQlite omits using certain C mechanisms (like pointers) in favor of passing copies of byte arrays around. You'd expect this to make it slower, but in many cases, it doesn't! (C# supports pointers, but the port doesn't use them so that it'll work in limited environments like Silverlight)

alexchamberlain · on Jan 12, 2012

This is slightly off topic, but a good read... http://cpp-next.com/archive/2009/08/want-speed-pass-by-value...

gaius · on Jan 12, 2012

The .NET runtime is well optimized for that sort of thing, as it needed to be for F#.

mdeicaza · on Jan 12, 2012

Additionally, C# collections do not need to box.

List<int> uses an actual array of integers as its backing store instead of an array of objects that contain boxed copies of the integers.

There is a port of JGit to .NET called NGit, which is what we use for MonoDevelop.

The port is maintained with an automatic tool that converts Java code to C# code, you can find it here:

https://github.com/slluis/ngit

ot · on Jan 13, 2012

> List<int> uses an actual array of integers as its backing store instead of an array of objects that contain boxed copies of the integers.

Yes, that's what I meant with specialized collections :)

BTW, I didn't know about the awesome NGit, automatic conversion from Java is really impressive!

erichocean · on Jan 12, 2012

Although Sqlite is highly optimized, it's got what amounts to its own VM internally, and doesn't really use C to the fullest.

It's not surprising to me that the C# version of that VM does as well (or better) for this particular codebase.

oconnor0 · on Jan 12, 2012

Can you explain more what you mean about SQLite having roughly its own VM?

erichocean · on Jan 12, 2012

See: http://www.sqlite.org/different.html

Money quote about half-way down the page: "SQL statements compile into virtual machine code"

Every SQL database engine compiles each SQL statement into some kind of internal data structure which is then used to carry out the work of the statement. But in most SQL engines that internal data structure is a complex web of interlinked structures and objects. In SQLite, the compiled form of statements is a short program in a machine-language like representation. Users of the database can view this virtual machine language by prepending the EXPLAIN keyword to a query.

The use of a virtual machine in SQLite has been a great benefit to the library's development. The virtual machine provides a crisp, well-defined junction between the front-end of SQLite (the part that parses SQL statements and generates virtual machine code) and the back-end (the part that executes the virtual machine code and computes a result.) The virtual machine allows the developers to see clearly and in an easily readable form what SQLite is trying to do with each statement it compiles, which is a tremendous help in debugging. Depending on how it is compiled, SQLite also has the capability of tracing the execution of the virtual machine - printing each virtual machine instruction and its result as it executes.

itmag · on Jan 13, 2012

Off-topic but relevant: I've been looking for info on just how database software does what it does. Ie from parsing the SQL query to hitting the disk, and everything in between. Anyone got any links or books?

kragen · on Jan 13, 2012

Gray and Reuter's Transaction Processing is absolutely superb, and it covers a good part of this stack, everything lower-level than query planning. However, it covers every approach to doing this. If you want to know how Postgres, say, does it, that's a lot less information, and may be more digestible.

itmag · on Jan 13, 2012

Thanks. I am interested in the general theory of how every database does it; it bothers me that this area of software engineering is terra incognita to me.

andrewf · on Jan 13, 2012

The sqlite source is well put together and pretty small. The (now pretty much obsolete) sqlite 2.x source was a lot smaller and might be a better starting point for pure learning purposes.

Bonus: in the sqlite command line tool, putting "EXPLAIN " before a query will dump out the VM commands that the query was converted to, without executing them.

csl · on Jan 13, 2012

Any decent textbook on databases would cover most of this, with the possible exception of parsing SQL. It will likely have an in-depth discussion of data structures to reduce disk access.

I strongly recommend you read one such book more or less end-to-end.

HN: As with other areas of computer science, there is often a "canonical" book, like TAOCP or CLRS on algorithms. Is there any such book for databases?

sqlwiz · on Jan 13, 2012

Nothing speaks like the source :) I recommend: http://www.postgresql.org/ftp/source/v9.1.2/

rogerbinns · on Jan 13, 2012

There are also details and examples at http://www.sqlite.org/vdbe.html

Things have changed a bit since then, but not much. SQLite's API is very different than regular databases because it is a library operating in the same process. In particular it does not calculate all result rows for a query up front (that wouldn't be very 'Lite') but instead calculates the next matching row as you ask for it.

Consequently the internals have to be able to record their state, return a row, and then resume from that state to get the next matching row. There is a also a fair amount of query optimisation that goes on, which again means the need for expressing queries in a variety of different building blocks. Combine the state machine with building blocks and you have a special purpose VM.

itmag · on Jan 13, 2012

In particular it does not calculate all result rows for a query up front (that wouldn't be very 'Lite') but instead calculates the next matching row as you ask for it.

That sounds like SQL cursors?

rogerbinns · on Jan 14, 2012

If you squint really hard you could make that case, but in SQLite they are really not the same. You cannot use SQL syntax and you can't do other operations on cursors other than read the columns for the matching row.

Virtually all other database engines calculate the query results up front. It is more effective in their implementations to do it that way. For example they will also calls to ask how many rows remain in the results. SQLite has no such API and the only way to find out is to actually retrieve each result row.

stcredzero · on Jan 12, 2012

All the points are valid but they are peculiar to Java, not to all managed high-level languages. C#/.NET, for example, have unsigned types, value-type arrays and structs, memory mapped files and specialized collections.

I heard about related stuff happening while I was working at a Smalltalk vendor. The programmer who was implementing the network cryptography library would just call up the VM engineer and ask for goodies like support for large bit arrays, and he'd get them in for the next VM release. The programmer was able to beat some of RSA data security's (poorly implemented) reference DLL's written in C by 3% with a Smalltalk program.

In the Smalltalk environments, it's easy to implement "primitives" coded in C, just in case you weren't internal to the vendor. With open VMs like Squeak, you can add your own bytecode to support optimizations if you want to.

zokier · on Jan 12, 2012

Slightly offtopic, but I wonder how much overhead in those benchmarks comes from calling native code from .NET runtime? An interesting data point could be benchmarking equivalent implementation in C or C++, avoiding the overhead of native-managed transition.

ot · on Jan 13, 2012

In my experience, interfacing native code via C++/CLI has negligible overhead, unless arguments are big and complex data types, which have to be converted to .NET types.

Otherwise, unsafe regions have practically zero overhead, and they are close enough to the metal.