Our sites have been down for more than 3 hours now. EDIT2: Now the databases are...

joshuak · on Nov 19, 2014

Major outages should absolutely weigh into your decisions as to what platform to use. That being said you can mitigate the effect of instability by engineering your app to failover to other availability zones or even to another cloud platform (depending on your app) if the entire platform goes down.

Obviously there is a segnifigant cost associated with engineering this level of cross platform redundancy which is why reliability is an important factor in making your platform choices. If you can tolerate some downtime, you can be more flexible, otherwise it will costs one way or the other.

In any case you should consider having a user notification site setup on a completely different service (or two) so that when things go wrong you can redirect everyone to that site to keep your customers informed. This is especially important when you have partial outages that could create inconstancies in your database or application state if you where to continue to allow users to interact with it in a degraded state.

inglor · on Nov 19, 2014

Thanks! This is very helpful.

Our big hosted site is hosted in Europe is actually working but our blogs and a news website are both down. We offer a paid service at 600$ a year and if the main site was down it would be very bad for our reputation.

Our DNS points to Azure on all these domains and things are hosted as "Azure Web Site" - how would notifications work if Azure itself is failing? Would I need to proxy the traffic through elsewhere?

Are there any services that solve this problem for me? I really don't mind paying a few dollars every month and not worry about this.

joshuak · on Nov 19, 2014

There are any number of uptime, and ping services that you can google for. This can raise the alarm in a timely fashion when your site, or parts of your site go down, and then you decide how to handle those issues.

You may also want to google for DNS failover services, to help you automatically redirect traffic in more catastrophic failure cases. There are offerings from google[1], AWS[2], and others.

[1]: https://cloud.google.com/dns/docs

[2]: http://aws.amazon.com/route53/

jfroma · on Nov 19, 2014

Our main cluster is on azure west us but we have another cluster on amazon east and route53 on top of that. When the main clusters fails, route53 switch to secondary, so we where not affected at all this time.

The only manual step was to delay the switch back until our vms where working fine and had all resources. We do this changing route53 health check to one that is always failing.

We had also to purge our crashed mongo nodes because the journal was broken.

https://auth0.com/availability-trust/img/auth0-infrastructur...

barkingllama · on Nov 19, 2014

On site disaster recovery? Off site disaster recovery? Split your hosting between multiple providers?

It really depends on how much risk you're willing to accept, and how much that is worth to you. It can be quantified via revenue lost, but reputation is much harder to put a number on.

inglor · on Nov 19, 2014

If I had to quantify this - 3 hours * 3 people who can't work and publish posts + about a week of marketing costs for damaged rep (apologies, PR, ads for exposure). I'd say that for the very least this cost us at least 1000$ and probably north of 3000$.

This is not the first time this has happened in the last two months (after a relatively reliable year). The problem is I'm not sure any other hosting provider would do any better.

freehunter · on Nov 19, 2014

So the question becomes, would putting a DR site on AWS or Google cost more than the $3000 this outage cost you? If the answer is no, wouldn't it be worth architecting to not put all of your eggs in one basket?

Be mad at the service provider if they don't live up to the number of nines they promised. Be mad at yourself if you expected more nines than they can deliver.

duncans · on Nov 19, 2014

Look into Cloudflare. They can act as a kind of reverse proxy to keep static stuff online. Obviously doesn't help if the transactional part of the site/database goes down, but end users will see a friendly message rather than it timing out.

toomuchtodo · on Nov 19, 2014

As others have mentioned, multiple cloud providers, service checks, and withdrawing bad providers at DNS.

photorized · on Nov 19, 2014

Azure + AWS