Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Note that the SQL Server team at Microsoft had the Windows folks add write-through and cache control to NTFS so that the database could have stability guarantees.

It's mind boggling that this is still an issue, given the number of decades we've had applications (databases, and other things) that need to efficiently guarantee bits are on stable storage.



It shouldn't be mind boggling, given that this guarantee is impossible to achieve. A little bit better or worse is not that big of a deal, still very far from 100%.


Er, no. I'm using a storage system right now that makes this guarantee; when the store responds to a block write, it's going to be stable. The implementation in question uses battery-backed RAM for performance, but that's just an implementation detail.

This is definitely not impossible. It's not even that difficult (and yes, I've written a number of transactional storage systems over the past 30 years).


How have you tested the failure modes?

I worked with a battery backup disk controller once, that had redundant failover to a duplicate unit. It worked great for many months with flawless failover. Except when one of the internal batteries died, the whole system deadlocked and the 24/7 nonstop cluster was down for days.

The more layers, the more unpredictable the failure modes.


But did it break its storage guarantees? Sounds to me like freezing the storage if the redundancy can no longer be guaranteed due to a bad battery may be the only way to keep the promise.


Failover doesn't mean that the system fails when you transfer load over.


> when the store responds to a block write, it's going to be stable.

That's the thing, you are communicating with your storage controller and these are just promises from your controller, not guarantees. Once you try to read the data back sometimes in the future there is no guarantee that you will succeed. A lot can happen between your controller reporting successful write and you retrieving the data: software bugs and false promises, hardware problems, operational problems, disasters, etc. There is a limit of how high of a probability of data retention a typical single server in a typical server room can achieve.


There are always implied assumptions, like assuming that there are no bugs in the kernel drivers themselves, and assuming that memory is not damaged.

Saying "we cannot achieve anything because there might be firmware bugs" is technically true, but completely counter-productive. Adding "A little bit better or worse is not that big of a deal" is just bad engineering -- can you imagine doctor saying, "you might get hit by a car at any time, so I decided it is not worth it to heal you"?


We can achieve a lot on top of unreliable components, buggy kernels, buggy firmware. Make unbelievably reliable systems. But can't, if we assume we can rely on unreliable components. This is what bad engineering is.


In this case, that stability is a promise from a SAN, a commercial product that has a very good reliability track record. We're pretty confident that the data we write is going to be stable, unless there's a physical disaster like a fire . . . which is why we write to multiples of these, which are physically distributed, have staggered software update schedules, etc. etc.

You can still lose everything, you can't control all failure modes. But you can plan for and protect against common disasters.

The original discussion was about the lack of OS support for proper flushing and making data stable. I think the lack of decent support for this is a decades-long travesty; claims that the lack of this functionality doesn't matter because "there might be firmware bugs in the storage system, so why bother?" are specious and unhelpful.


The support for flushing and making data stable has been around for a long time.

The problem is that using that support kills performance, so apps often don't.


It's definitely not an implementation detail. If the RAM fails somehow, it has lied about that data being durably stored.

But yes, it's absolutely possible for all of this to be vertically integrated, and agreed that it's ridiculous that it hasn't happened. Most systems shouldn't care, but the ones that do care really really care.


If the disk fails somehow it has lied as well. There is no such thing as guaranteed anything, you can use parity and striping to better your chances but that's not storage media dependent.


Disks, both spinning and SSD, lie about data durability more often then you would think. This is particularly a problem with consumer disks and write-back caches. Postgres has a good state of play writeup on reliably writing data to storage in their "Reliability and the Write-Ahead Log" documentation [1].

My rule of thumb is that if I want serious reliability I make sure my data winds up on three different pieces of hardware and includes CRC codes. So, basically ZFS or some cloud/cluster equivalent like S3.

On the other hand, for day to day work on your dev machine, the drives are so crazy reliable that you just don't worry about it. Which, of course, can bite you if you translate your day to day experience to production at scale. Everything breaks at scale.

[1] https://www.postgresql.org/docs/devel/wal-reliability.html


We need a "falsehoods programmers believe about disk reliability".


> three different pieces of hardware

This. Data is real when it has been acked by a quorum with independent power. Three is because two doesn't give you enough slack to maintain it.


>The implementation in question uses battery-backed RAM for performance, but that's just an implementation detail.

Can you give more details on this? (What is the solution, and what does it cost? What kind of computer you use it on? How large is the memory?) Very curious.


It's a common technique in SAN controllers, where writes go to a battery-backed cache, and are then written to the actual storage devices (using variations of RAID) at the controller's leisure. The limited space in the immortal DRAM provides enough buffer to give the host near immediate responses (sub-sub-millisecond over fiber channel), while the data is safe if the controller crashes or loses power (the data is written as part of the recovery process). This organization also lets the controller gather up related writes and be more efficient about them, rather than writing each block in isolation.

The battery-backed DRAM really is an implementation detail. Things will execute correctly if you don't have it, but it's usually a huge performance win.

In this case we're using Nimble SANs (Nimble is now owned by HP and suffering customer service rot, oh well). The immortal DRAM is fairly small (8GB per controller head?). The storage arrays are petabyte scale, all flash, with many, many 16 and 32 Gbit fiber channel connections. Cost is a few million per SAN instance, of which you need several for real durability (and a replication scheme for remote storage, which I'm not going to discuss here).


> The battery-backed DRAM really is an implementation detail. Things will execute correctly if you don't have it, but it's usually a huge performance win.

For anyone interested, especially the -N variant:

* https://en.wikipedia.org/wiki/NVDIMM

Back when ZFS was still new-ish, and SSDs were still expensive-ish (~2008), people were experimenting with using ZILs (ZFS Intent Log) on these types of devices:

* https://techreport.com/review/16255/acard-ans-9010-serial-at...

SSDs have come down in price since than, so people don't bother with RAM disks as much now.


Thank you so much for this very detailed reply. very interesting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: