Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I can't wait for direct I/O (now behind debug_io_direct setting).


Curious what your use case is for wanting direct_io? Every DBA I've ever worked with when setting up a new database, the first thing they want to do is enable direct io. My worst experience was with IBM DB2 mounted over NFS talking to netapp. Performance complaints would come from customers and land on the CEO's desk. He'd go to the software team and tell them to fix it. They'd say the DB is slow. Then he'd go to the DBAs and tell them to tune the DB. They'd say there's nothing more to do, we need faster disks. So he'd end up in front of me on the sysop team asking if we had any faster disks laying around (we didn't and buying more wasn't in the budget).

Since it was NFS, you could just use tcpdump and watch what DB2 was doing on the wire. It was happily poking away sending and receiving packets all 1K in size (the current configured DB block size) with peak read and and write speeds of about 11MB/s. Since the DBAs didn't want to change settings on a production DB, I set up a testing environment, begged them to play with the direct io and block size settings on this new instance and figure out the best performance. When I checked back days later, it was set up exactly the same, "we follow best practices, use 1K block size and force direct io".

I ended up creating a VM under the guise of "we need a data warehouse" with 1/4 the cpus and ram as the DB2 machines and installed postgresql 9.2. Did a minimum amount of tuning, mostly just turning off fsync for WAL writes, then spent a week filling it up with 5TB of data and 15 billion rows from the production DB. Ran one of our analytic queries that had grown to taking 30 hours on DB2, it ran in 6 hours. The packet sizes over NFS were 32-64MB in size and getting peak speeds of 180-220MB/s on the wire.


The 1k packets you saw probably correspond to the default block size being used for the DB, that is a vestige of using spinning disks. That you were using NFS or any kind of networked filesystem is what I'd say is a performance hostile environment. Did no one think of just not using NFS?


This was before 2012, AWS did not exist. The company had to find rackspace in a data center. Which we couldn't. One of the funding customers "loaned" us a couple of slots in their on premise data center which fit only a bladecenter and single netapp. NFS had advantages, you could dynamically resize live mount points, etc. Plus as I mentioned, using postgresql did away with our performance issues.


> Did a minimum amount of tuning, mostly just turning off fsync for WAL writes

That is not something I would suggest to people on production systems, as that would give you a good chance of data loss when the system halts. So, out of interest, were there any circumstances why turning off WAL fsync was considered a good choice in your situation?


I probably meant to say "synchronous_commit", which is how data is written to disk from the WAL. If you want full data guarantees, with regular hard drives, you'd be looking at less than 200 transactions per second. You set synchronous_commit to off, and suddenly you can do 10k transactions per second. You can tune when the WAL gets flushed to disk based on time and/or size. So you can set the amount of recent data loss you are comfortable with.

From the docs: "setting this parameter to off does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly. So, turning synchronous_commit off can be a useful alternative when performance is more important than exact certainty"


> If you want full data guarantees, with regular hard drives, you'd be looking at less than 200 transactions per second.

That sounds like an anecdote from the time before SSDs.

> "setting this parameter to off does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly. "

If your application is told that the transactions were committed, but your DB actually hasn't, you got a problem, methinks.


Synchronous replication exists to avoid more of the transactions committed problem.

Thing is DB work is all about tradeoffs, so you'll never be free of them.

As the parent explained, PostgreSQL actually lets you make a very good trade: you get full cached level performance but lose only a couple of transactions at most, instead of data corruption. If that's not ok for you, the leash can be tightened to lose nothing, but then your hardware had better keep up with your platform.


> That sounds like an anecdote from the time before SSDs.

Yes, I stated "regular hard drives" :)


Re-read their post. It was a secondary system setup to just run these analytics which were loaded from the production system. No issues if the whole machine had to be rebuilt.


I'm thinking lower resource usage (no double caching of data), shorter path to data so I would expect fewer bad things might happen during commit, and better performance in terms of transactions per second. Otherwise, I can't explain it any better than one of the lead developers himself: https://www.postgresql.org/message-id/20210223100344.llw5an2...


The new debug_io_direct flag only triggers direct IO in very limited cases, and is only tangentially related to the AIO patchset discussed in that thread.

Note that the documentation on the config flag explicitly warns about not using it in production:

> Currently this feature reduces performance, and is intended for developer testing only.

Also note that very few things will actually do IO during commit - the only IO that I can think of are 1.) the WAL-logging of the commit (often small, a few 100 bytes at most), and 2.) replying to the COMMIT command (10s of bytes at most). It is quite unlikely that this will see much performance benefit from IO_DIRECT without further infrastructure inside PostgreSQL around io_uring and other async kernel IO apis.


I know, but it's still nice to see progress in this area. Even PG17 would probably be too early to expect this work to be finished.


Yes when direct io is brought up, it is usually followed with the "double buffering" argument. That is valid, but only if your disk speeds are well above 1,000MB/s. Outside of hardware like that, you are always going to be waiting on disk.

From Linus himself[0]

"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances"

[0] https://lkml.org/lkml/2002/5/11/58


Have you read the entire thread you linked? People explained to Linus why direct IO is important to them. Besides, it's 20 years old and there were even no SSDs back then. The lack of direct IO in PG was one of (one of) the reasons why Uber moved to MySQL (https://www.uber.com/en-PL/blog/postgres-to-mysql-migration/, "The Buffer Pool" Section). With buffered IO you will likely store a lot of the same data in memory twice- once in DB's memory and then in page cache. Now you can just give the memory used by page cache directly to DB, because it knows better what and when it needs.


The thread is large, maybe I read the whole thing at one time, but the point of it is literally someone asking why using direct io is slower. That is the point I make, you can't just enable direct io (which by passes all the logic to speed up reads and writes) and expect increased performance without a lot of extra up front work.

But back to my first question, I am still curious what your workload is that you feel will benefit from direct io :)


As others have written, this work entails implementing asynchronous IO. One system I worked with that was more heavily loaded than the usual intranet CRUD was an SMS lottery run by a radio station. They had "bonus rounds" that lasted a few minutes. When the host said "now send!" we were receiving a huge amount of data at the same time. However, it worked for years on Postgres 7.4 and regular hard drives :)


I suspect Linus' position has radically changed with io-uring and proliferation of NVMe storage devices.

I use io-uring with O_DIRECT at work and the performance graphs of TLB pressure are beautiful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: