Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Our sites have been down for more than 3 hours now.

EDIT2: Now the databases are down, this is costing us a lot of money. EDIT: Just went up again.

It would be great if anyone knows how to mitigate these in the future - what can I do to protect myself against this in the future? (Except leave Azure)



Major outages should absolutely weigh into your decisions as to what platform to use. That being said you can mitigate the effect of instability by engineering your app to failover to other availability zones or even to another cloud platform (depending on your app) if the entire platform goes down.

Obviously there is a segnifigant cost associated with engineering this level of cross platform redundancy which is why reliability is an important factor in making your platform choices. If you can tolerate some downtime, you can be more flexible, otherwise it will costs one way or the other.

In any case you should consider having a user notification site setup on a completely different service (or two) so that when things go wrong you can redirect everyone to that site to keep your customers informed. This is especially important when you have partial outages that could create inconstancies in your database or application state if you where to continue to allow users to interact with it in a degraded state.


Thanks! This is very helpful.

Our big hosted site is hosted in Europe is actually working but our blogs and a news website are both down. We offer a paid service at 600$ a year and if the main site was down it would be very bad for our reputation.

Our DNS points to Azure on all these domains and things are hosted as "Azure Web Site" - how would notifications work if Azure itself is failing? Would I need to proxy the traffic through elsewhere?

Are there any services that solve this problem for me? I really don't mind paying a few dollars every month and not worry about this.


There are any number of uptime, and ping services that you can google for. This can raise the alarm in a timely fashion when your site, or parts of your site go down, and then you decide how to handle those issues.

You may also want to google for DNS failover services, to help you automatically redirect traffic in more catastrophic failure cases. There are offerings from google[1], AWS[2], and others.

[1]: https://cloud.google.com/dns/docs

[2]: http://aws.amazon.com/route53/


Our main cluster is on azure west us but we have another cluster on amazon east and route53 on top of that. When the main clusters fails, route53 switch to secondary, so we where not affected at all this time.

The only manual step was to delay the switch back until our vms where working fine and had all resources. We do this changing route53 health check to one that is always failing.

We had also to purge our crashed mongo nodes because the journal was broken.

https://auth0.com/availability-trust/img/auth0-infrastructur...


On site disaster recovery? Off site disaster recovery? Split your hosting between multiple providers?

It really depends on how much risk you're willing to accept, and how much that is worth to you. It can be quantified via revenue lost, but reputation is much harder to put a number on.


If I had to quantify this - 3 hours * 3 people who can't work and publish posts + about a week of marketing costs for damaged rep (apologies, PR, ads for exposure). I'd say that for the very least this cost us at least 1000$ and probably north of 3000$.

This is not the first time this has happened in the last two months (after a relatively reliable year). The problem is I'm not sure any other hosting provider would do any better.


So the question becomes, would putting a DR site on AWS or Google cost more than the $3000 this outage cost you? If the answer is no, wouldn't it be worth architecting to not put all of your eggs in one basket?

Be mad at the service provider if they don't live up to the number of nines they promised. Be mad at yourself if you expected more nines than they can deliver.


Look into Cloudflare. They can act as a kind of reverse proxy to keep static stuff online. Obviously doesn't help if the transactional part of the site/database goes down, but end users will see a friendly message rather than it timing out.


As others have mentioned, multiple cloud providers, service checks, and withdrawing bad providers at DNS.


Azure + AWS




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: