Can we trust the public cloud vendors?

Amazon’s Simple Storage Service (S3) outage on Feb. 28 took down many well-known websites and web services. For the complete post-mortem from Amazon Web Services (AWS), read this lengthy explanation of what went wrong and what AWS is doing to address the issue.

If the full explanation too long and too complicated, here is a short version:

  • An administrator was going to perform maintenance on a set of S3 servers.
  • He mis-typed the command to take a set of servers offline, and more servers than intended were taken off line
  • This took the entire S3 environment in the U.S. East Zone closer to the edge in capacity than the system was designed for and caused widespread availability issues in web services that relied upon the S3 environment. 

More instructive and more worrisome are the steps Amazon took to prevent this issue from happening again:

  • The tool that took the servers offline was modified to prevent it from taking too many servers offline at the same time. 
  • The entire S3 System was refactored into smaller cells so that if any one cell went down, it would impact fewer systems.
  • Other tools were audited to make sure they were not subject to the same flaws as the tool that was used to cause this outage.

Why this raises questions of trust with public cloud vendors

There is no doubt that Amazon AWS is a ground-breaking cloud service and that Amazon continues to lead the public cloud market both in terms of scale and in bringing new innovative services to the market. However, operating a system at this level of scale while constantly changing it to deliver new services is not something anyone else in the computer industry has ever done.

Leave a Reply

Your email address will not be published. Required fields are marked *