> when the store responds to a block write, it's going to be stable. That's the ...

theamk · on May 29, 2019

There are always implied assumptions, like assuming that there are no bugs in the kernel drivers themselves, and assuming that memory is not damaged.

Saying "we cannot achieve anything because there might be firmware bugs" is technically true, but completely counter-productive. Adding "A little bit better or worse is not that big of a deal" is just bad engineering -- can you imagine doctor saying, "you might get hit by a car at any time, so I decided it is not worth it to heal you"?

zzzcpan · on May 29, 2019

We can achieve a lot on top of unreliable components, buggy kernels, buggy firmware. Make unbelievably reliable systems. But can't, if we assume we can rely on unreliable components. This is what bad engineering is.

kabdib · on May 29, 2019

In this case, that stability is a promise from a SAN, a commercial product that has a very good reliability track record. We're pretty confident that the data we write is going to be stable, unless there's a physical disaster like a fire . . . which is why we write to multiples of these, which are physically distributed, have staggered software update schedules, etc. etc.

You can still lose everything, you can't control all failure modes. But you can plan for and protect against common disasters.

The original discussion was about the lack of OS support for proper flushing and making data stable. I think the lack of decent support for this is a decades-long travesty; claims that the lack of this functionality doesn't matter because "there might be firmware bugs in the storage system, so why bother?" are specious and unhelpful.

caf · on May 30, 2019

The support for flushing and making data stable has been around for a long time.

The problem is that using that support kills performance, so apps often don't.