All the points are valid but they are peculiar to Java, not to all managed high-level languages. C#/.NET, for example, have unsigned types, value-type arrays and structs, memory mapped files and specialized collections.
As an example, the C# port of Sqlite is sometimes faster than the C version on queries, although updates are slower, despite Sqlite is a highly optimized C library.
It's also worth pointing out that the C# port of SQlite omits using certain C mechanisms (like pointers) in favor of passing copies of byte arrays around. You'd expect this to make it slower, but in many cases, it doesn't! (C# supports pointers, but the port doesn't use them so that it'll work in limited environments like Silverlight)
Money quote about half-way down the page: "SQL statements compile into virtual machine code"
Every SQL database engine compiles each SQL statement into some kind of internal data structure which is then used to carry out the work of the statement. But in most SQL engines that internal data structure is a complex web of interlinked structures and objects. In SQLite, the compiled form of statements is a short program in a machine-language like representation. Users of the database can view this virtual machine language by prepending the EXPLAIN keyword to a query.
The use of a virtual machine in SQLite has been a great benefit to the library's development. The virtual machine provides a crisp, well-defined junction between the front-end of SQLite (the part that parses SQL statements and generates virtual machine code) and the back-end (the part that executes the virtual machine code and computes a result.) The virtual machine allows the developers to see clearly and in an easily readable form what SQLite is trying to do with each statement it compiles, which is a tremendous help in debugging. Depending on how it is compiled, SQLite also has the capability of tracing the execution of the virtual machine - printing each virtual machine instruction and its result as it executes.
Off-topic but relevant: I've been looking for info on just how database software does what it does. Ie from parsing the SQL query to hitting the disk, and everything in between. Anyone got any links or books?
Gray and Reuter's Transaction Processing is absolutely superb, and it covers a good part of this stack, everything lower-level than query planning. However, it covers every approach to doing this. If you want to know how Postgres, say, does it, that's a lot less information, and may be more digestible.
Thanks. I am interested in the general theory of how every database does it; it bothers me that this area of software engineering is terra incognita to me.
The sqlite source is well put together and pretty small. The (now pretty much obsolete) sqlite 2.x source was a lot smaller and might be a better starting point for pure learning purposes.
Bonus: in the sqlite command line tool, putting "EXPLAIN " before a query will dump out the VM commands that the query was converted to, without executing them.
Any decent textbook on databases would cover most of this, with the possible exception of parsing SQL. It will likely have an in-depth discussion of data structures to reduce disk access.
I strongly recommend you read one such book more or less end-to-end.
HN: As with other areas of computer science, there is often a "canonical" book, like TAOCP or CLRS on algorithms. Is there any such book for databases?
Things have changed a bit since then, but not much. SQLite's API is very different than regular databases because it is a library operating in the same process. In particular it does not calculate all result rows for a query up front (that wouldn't be very 'Lite') but instead calculates the next matching row as you ask for it.
Consequently the internals have to be able to record their state, return a row, and then resume from that state to get the next matching row. There is a also a fair amount of query optimisation that goes on, which again means the need for expressing queries in a variety of different building blocks. Combine the state machine with building blocks and you have a special purpose VM.
In particular it does not calculate all result rows for a query up front (that wouldn't be very 'Lite') but instead calculates the next matching row as you ask for it.
If you squint really hard you could make that case, but in SQLite they are really not the same. You cannot use SQL syntax and you can't do other operations on cursors other than read the columns for the matching row.
Virtually all other database engines calculate the query results up front. It is more effective in their implementations to do it that way. For example they will also calls to ask how many rows remain in the results. SQLite has no such API and the only way to find out is to actually retrieve each result row.
All the points are valid but they are peculiar to Java, not to all managed high-level languages. C#/.NET, for example, have unsigned types, value-type arrays and structs, memory mapped files and specialized collections.
I heard about related stuff happening while I was working at a Smalltalk vendor. The programmer who was implementing the network cryptography library would just call up the VM engineer and ask for goodies like support for large bit arrays, and he'd get them in for the next VM release. The programmer was able to beat some of RSA data security's (poorly implemented) reference DLL's written in C by 3% with a Smalltalk program.
In the Smalltalk environments, it's easy to implement "primitives" coded in C, just in case you weren't internal to the vendor. With open VMs like Squeak, you can add your own bytecode to support optimizations if you want to.
Slightly offtopic, but I wonder how much overhead in those benchmarks comes from calling native code from .NET runtime? An interesting data point could be benchmarking equivalent implementation in C or C++, avoiding the overhead of native-managed transition.
In my experience, interfacing native code via C++/CLI has negligible overhead, unless arguments are big and complex data types, which have to be converted to .NET types.
Otherwise, unsafe regions have practically zero overhead, and they are close enough to the metal.
As an example, the C# port of Sqlite is sometimes faster than the C version on queries, although updates are slower, despite Sqlite is a highly optimized C library.
EDIT: link http://code.google.com/p/csharp-sqlite/wiki/Benchmarks