"Cloud" infrastructure customers do still need to engineer availability

Today Amazon Web Services had another partial outage, impacting a number of their customer sites.  This has happened a few times and it is always met with declarations of the apocalypse by tech reporters.  I find that irresponsible and silly.  Overall AWS has a good availability record but any IT infrastructure can and will fail from time to time.  In the case of "cloud" providers like Amazon, these failures are amplified by the fact that they impact a significant number of independent companies relying on their services.

But that doesn't mean that companies shouldn't use cloud service providers.  It's still a fantastic value proposition, enabling small companies to compete with large incumbents without prohibitive capital requirements and larger companies to handle surge traffic without having to significantly overbuild their internal infrastructure.

Still, whether you are relying on your own infrastructure or someone else's, you have to engineer for availability if that's important to your business.  AllThingsD has an article on the outage with the incendiary title, "Amazon's Cloud Is Down Again, Taking Heroku With It".  But that's factually incorrect.  Amazon's "cloud" was not down.  There was a partial outage with EBS and several other services within the Northern Virginia Availability Zone.  Sites that are using the affected services within that availability zone are indeed impacted by the outage.  But an Amazon quote from the AllThingsD article states:

2:20 PM PDT We’ve now restored performance for about half of the volumes that experienced issues. Instances that were attached to these recovered volumes are recovering. We’re continuing to work on restoring availability and performance for the volumes that are still degraded.

We also want to add some detail around what customers using ELB may have experienced. Customers with ELBs running in only the affected Availability Zone may be experiencing elevated error rates and customers may not be able to create new ELBs in the affected Availability Zone. For customers with multi-AZ ELBs, traffic was shifted away from the affected Availability Zone early in this event and they should not be seeing impact at this time.

If Heroku was impacted by this outage it's a function of availability choices they made.  I have no experience with or relationship to Heroku so I don't know if their engineering choices were made based on cost, engineering complexity, or other factors, but the bottom line is that they are apparently choosing to host their service within a single AWS Availability Zone.  That is not a solution for high availability and it's unfair to blame the service provider when that provider offers solutions to mitigate these types of single Avaiability Zone failures.

Just because you're using someone else's infrastructure does not mean you don't need to make engineering choices to see maximum availability and it's irresponsible reporting to place all of the blame on the cloud provider.