Note that the SQL Server team at Microsoft had the Windows folks add write-throu...

zzzcpan · on May 29, 2019

It shouldn't be mind boggling, given that this guarantee is impossible to achieve. A little bit better or worse is not that big of a deal, still very far from 100%.

kabdib · on May 29, 2019

Er, no. I'm using a storage system right now that makes this guarantee; when the store responds to a block write, it's going to be stable. The implementation in question uses battery-backed RAM for performance, but that's just an implementation detail.

This is definitely not impossible. It's not even that difficult (and yes, I've written a number of transactional storage systems over the past 30 years).

paulsutter · on May 29, 2019

How have you tested the failure modes?

I worked with a battery backup disk controller once, that had redundant failover to a duplicate unit. It worked great for many months with flawless failover. Except when one of the internal batteries died, the whole system deadlocked and the 24/7 nonstop cluster was down for days.

The more layers, the more unpredictable the failure modes.

unilynx · on May 29, 2019

But did it break its storage guarantees? Sounds to me like freezing the storage if the redundancy can no longer be guaranteed due to a bad battery may be the only way to keep the promise.

ori_b · on May 29, 2019

Failover doesn't mean that the system fails when you transfer load over.

zzzcpan · on May 29, 2019

> when the store responds to a block write, it's going to be stable.

That's the thing, you are communicating with your storage controller and these are just promises from your controller, not guarantees. Once you try to read the data back sometimes in the future there is no guarantee that you will succeed. A lot can happen between your controller reporting successful write and you retrieving the data: software bugs and false promises, hardware problems, operational problems, disasters, etc. There is a limit of how high of a probability of data retention a typical single server in a typical server room can achieve.

theamk · on May 29, 2019

There are always implied assumptions, like assuming that there are no bugs in the kernel drivers themselves, and assuming that memory is not damaged.

Saying "we cannot achieve anything because there might be firmware bugs" is technically true, but completely counter-productive. Adding "A little bit better or worse is not that big of a deal" is just bad engineering -- can you imagine doctor saying, "you might get hit by a car at any time, so I decided it is not worth it to heal you"?

zzzcpan · on May 29, 2019

We can achieve a lot on top of unreliable components, buggy kernels, buggy firmware. Make unbelievably reliable systems. But can't, if we assume we can rely on unreliable components. This is what bad engineering is.

kabdib · on May 29, 2019

In this case, that stability is a promise from a SAN, a commercial product that has a very good reliability track record. We're pretty confident that the data we write is going to be stable, unless there's a physical disaster like a fire . . . which is why we write to multiples of these, which are physically distributed, have staggered software update schedules, etc. etc.

You can still lose everything, you can't control all failure modes. But you can plan for and protect against common disasters.

The original discussion was about the lack of OS support for proper flushing and making data stable. I think the lack of decent support for this is a decades-long travesty; claims that the lack of this functionality doesn't matter because "there might be firmware bugs in the storage system, so why bother?" are specious and unhelpful.

caf · on May 30, 2019

The support for flushing and making data stable has been around for a long time.

The problem is that using that support kills performance, so apps often don't.

Groxx · on May 29, 2019

It's definitely not an implementation detail. If the RAM fails somehow, it has lied about that data being durably stored.

But yes, it's absolutely possible for all of this to be vertically integrated, and agreed that it's ridiculous that it hasn't happened. Most systems shouldn't care, but the ones that do care really really care.

zamadatix · on May 29, 2019

If the disk fails somehow it has lied as well. There is no such thing as guaranteed anything, you can use parity and striping to better your chances but that's not storage media dependent.

programd · on May 29, 2019

Disks, both spinning and SSD, lie about data durability more often then you would think. This is particularly a problem with consumer disks and write-back caches. Postgres has a good state of play writeup on reliably writing data to storage in their "Reliability and the Write-Ahead Log" documentation [1].

My rule of thumb is that if I want serious reliability I make sure my data winds up on three different pieces of hardware and includes CRC codes. So, basically ZFS or some cloud/cluster equivalent like S3.

On the other hand, for day to day work on your dev machine, the drives are so crazy reliable that you just don't worry about it. Which, of course, can bite you if you translate your day to day experience to production at scale. Everything breaks at scale.

[1] https://www.postgresql.org/docs/devel/wal-reliability.html

ecnahc515 · on May 30, 2019

We need a "falsehoods programmers believe about disk reliability".

erik_seaberg · on May 30, 2019

> three different pieces of hardware

This. Data is real when it has been acked by a quorum with independent power. Three is because two doesn't give you enough slack to maintain it.

logicallee · on May 29, 2019

>The implementation in question uses battery-backed RAM for performance, but that's just an implementation detail.

Can you give more details on this? (What is the solution, and what does it cost? What kind of computer you use it on? How large is the memory?) Very curious.

kabdib · on May 30, 2019

It's a common technique in SAN controllers, where writes go to a battery-backed cache, and are then written to the actual storage devices (using variations of RAID) at the controller's leisure. The limited space in the immortal DRAM provides enough buffer to give the host near immediate responses (sub-sub-millisecond over fiber channel), while the data is safe if the controller crashes or loses power (the data is written as part of the recovery process). This organization also lets the controller gather up related writes and be more efficient about them, rather than writing each block in isolation.

The battery-backed DRAM really is an implementation detail. Things will execute correctly if you don't have it, but it's usually a huge performance win.

In this case we're using Nimble SANs (Nimble is now owned by HP and suffering customer service rot, oh well). The immortal DRAM is fairly small (8GB per controller head?). The storage arrays are petabyte scale, all flash, with many, many 16 and 32 Gbit fiber channel connections. Cost is a few million per SAN instance, of which you need several for real durability (and a replication scheme for remote storage, which I'm not going to discuss here).

throw0101a · on May 30, 2019

> The battery-backed DRAM really is an implementation detail. Things will execute correctly if you don't have it, but it's usually a huge performance win.

For anyone interested, especially the -N variant:

* https://en.wikipedia.org/wiki/NVDIMM

Back when ZFS was still new-ish, and SSDs were still expensive-ish (~2008), people were experimenting with using ZILs (ZFS Intent Log) on these types of devices:

* https://techreport.com/review/16255/acard-ans-9010-serial-at...

SSDs have come down in price since than, so people don't bother with RAM disks as much now.

logicallee · on May 30, 2019

Thank you so much for this very detailed reply. very interesting.