The absolutely stellar safety record of the airline industry exists because they have repeatedly and consistent pushed past human error as an explanation for a crash.
Understanding that if a human made this mistake, another future human will likely make the exact same mistake, they push to understand _why_ an airline captain that should know made this error. Then try to correct for that.
If every crash was written off as "the captain should have known better", aviation would not be nearly as safe as it is.
Perhaps not all tech companies care to push past human error in each postmortem (or even have a proper formal process at all), but some are known to do just that. Etsy and Google are among the well-documented cases.
> This idea of digging deeper into the circumstance and environment that an engineer found themselves in is called looking for the “Second Story”. In Post-Mortem meetings, we want to find Second Stories to help understand what went wrong.
> Blameless postmortems are a tenet of SRE culture. … Blameless culture originated in the healthcare and avionics industries where mistakes can be fatal. These industries nurture an environment where every "mistake" is seen as an opportunity to strengthen the system. When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place. You can’t "fix" people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems.
— Site Reliability Engineering — Postmortem Culture: Learning from Failure
Amazon is pretty strong on this front too, the most recent public example is the S3 outage and postmortem[1].
>Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.... We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future.
While I agree generally with that sentiment, there's a key difference.
Airplane pilots are licensed, certified, trained, and regulated. There's a clear floor to who is allowed in the cockpit (barring extreme emergencies, e.g., incapacitation of a pilot).
By contrast, software is made available to pretty much the entire world. And it turns out that two thirds of all adults have "poor", "below poor", or no computer skills at all. Which is to say, the qualifications floor is nonexistent.
If you're designing a one-size-fits-all system, you've got to design for this. The results, I'd argue, are ... not particularly satisfactory.
I'm not saying "don't design for the user in mind", or "don't dismiss user error". But rather, than when your floor is zero, you're going to have a remarkably difficult challenge.
They do, or at least the big-scale places who know what they’re doing do. When Gmail or aws-east-1 or something goes down for hours, nobody is fired: they figure out why it was allowed to happen and change procedures so it can’t happen again.
They do. That's what UX is all about. Not just UI, not just making thing possible or visual, but actually making the experience as smooth and frictionless as possible.
UX is the bane of usability (at least as practiced). E.g. We've had professional "UX" designers seriously contend that "cancel" buttons should be made to not look like buttons. The joke at a certain large company was that if the UX designers had their way every default button would be big and green and every other button would be invisible.
True usability is about helping people make the correct decision, not just being smooth and frictionless. (See Don Norman's classic example of fire escapes that send people into basements to be trapped.)
I'm a daily user of some of the most complicated engineering software on the market. Every single release promises "Easier to use!" but all they do is make the most basic beginner-level actions more prominent and hide all medium to advanced level important functionality behind 3 extra clicks.
The end result is a beginner can do a tutorial exercise in 20 minutes instead of 30, while any true day to day work in the software takes 4 hours instead of 2.
Do you think this tradeoff between easy-to-start-using and power-use-friendly is intrinsic, a scale that designers need to choose where they want their software in it, or is it possible to hit both ends? Does anyone have examples of things they think address both beginners and power users adequately?
True. But as with any other discipline, it's about thinking about the intent and final goal, not just following a rule book containing arbitrary roles.
Understanding that if a human made this mistake, another future human will likely make the exact same mistake, they push to understand _why_ an airline captain that should know made this error. Then try to correct for that.
If every crash was written off as "the captain should have known better", aviation would not be nearly as safe as it is.