Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

msync lets you force a flush so you can control the latest possible moment for a writeout. But the OS can flush before that, and you have no way to detect or control that. So you can only control the late side of the timing, not the early side. And in databases, you usually need writes to be persisted in a specific order; early writes are just as harmful as late writes.


I'd even take a memory ordering guarantee, something like, within each page, data is read out sequentially as atomic aligned 64-bit reads with acquire ordering. (Though this probably is what you get on AMD64.) As-is, there's not even a guarantee against an atomic aligned write being torn when written out.


That is absolutely not what you actually get from the hardware.

For fun, there is no guarantee in terms of writing a page in what order it is written. SQLite documents that they assume (but cannot verify) that _sector_ writes are linear, but not atomic. https://www.sqlite.org/atomiccommit.html

> If a power failure occurs in the middle of a sector write it might be that part of the sector was modified and another part was left unchanged. The key assumption by SQLite is that if any part of the sector gets changed, then either the first or the last bytes will be changed. So the hardware will never start writing a sector in the middle and work towards the ends. We do not know if this assumption is always true but it seems reasonable.

You are talking several levels higher than that, at the page level (composed of multiple sectors).

Assume that they reside in _different_ physical locations, and are written at different times. That's fun.


Every HDD since the 1980s has guaranteed atomic sector writes:

> Currently all hard drive/SSD manufacturers guarantee that 512 byte sector writes are atomic. As such, failure to write the 106 byte header is not something we account for in current LMDB releases. Also, failures of this type should result in ECC errors in the disk sector - it should be impossible to successfully read a sector that was written incorrectly in the ways you describe.

Even in extreme cases, the probability of failure to write the leading 128 out of 512 bytes of a sector is nearly nil - even on very old hard drives, before 512-byte sector write guarantees. We would have to go back nearly 30 years to find such a device, e.g.

https://archive.org/details/bitsavers_quantumQuaroductManual...

Page 23, Section 2.1 "No damage or loss of data will occur if power is applied or removed during drive operation, except that data may be lost in the sector being written at the time of power loss."

  From the specs on page 15, the data transfer rate to/from the platters is
 1.25MB/sec, so the time to write one full sector is 0.4096ms; the time to
 write the leading 128 bytes of the sector is thus 1/4 of that: 0.10ms. You
 would have to be very very unlucky to have a power failure hit the drive
 within this .1ms window of time. Fast-forward to present day and it's simply
 not an issue.
^ above quoted from https://lists.openldap.org/hyperkitty/list/openldap-devel@op...


Doesn't help when you work with pages :-)

Assume 512 sectors ( I know those are rare ), but I don't think that there is any guarantees that 4KB page would be:

* Written atomically * Written in a particular order


Even memory ordering guarantees within sector boundaries are sufficient, and something the kernel could provide on its own.


Also doesn't help when you are running on virtual / networked hardware. Nothing ensure that what you think is a sector write would actually align properly with the hardware.


The design and guarantees of the virtualized hardware provide that guarantee. I've worked on several such products. They all guarantee atomic sector writes (typically via copy-on-write).




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: