Amazon’s Simple Storage Service (S3) outage on Feb. 28 took down many well-known websites and web services. For the complete post-mortem from Amazon Web Services (AWS), read this lengthy explanation of what went wrong and what AWS is doing to address the issue.
If the full explanation too long and too complicated, here is a short version:
- An administrator was going to perform maintenance on a set of S3 servers.
- He mis-typed the command to take a set of servers offline, and more servers than intended were taken off line
- This took the entire S3 environment in the U.S. East Zone closer to the edge in capacity than the system was designed for and caused widespread availability issues in web services that relied upon the S3 environment.
More instructive and more worrisome are the steps Amazon took to prevent this issue from happening again:
- The tool that took the servers offline was modified to prevent it from taking too many servers offline at the same time.
- The entire S3 System was refactored into smaller cells so that if any one cell went down, it would impact fewer systems.
- Other tools were audited to make sure they were not subject to the same flaws as the tool that was used to cause this outage.
Why this raises questions of trust with public cloud vendors
There is no doubt that Amazon AWS is a ground-breaking cloud service and that Amazon continues to lead the public cloud market both in terms of scale and in bringing new innovative services to the market. However, operating a system at this level of scale while constantly changing it to deliver new services is not something anyone else in the computer industry has ever done.
By way of comparison, the early promoters of public cloud services said computing should be a utility like electricity. There might be multiple providers of electricity, but it should be reliable and simple to consume. Electricity is, in fact, a commodity across the U.S., and it is extremely reliable (generally only impacted by weather events). And while there is innovation going on in terms of how electricity is being produced, that does not result in scores of new electric services being released every year.
So, if someone is going to operate a service like a public cloud that people are supposed to be able to count on, while changing it under the hood on a continuous basis, and constantly releasing improvements to it, they have taken a management burden upon themselves that no one else in the computing industry has ever shouldered. This gives rise to some very tough questions, which none of the public cloud vendors has been forthcoming to answer:
- What metrics are you using to measure the performance (how long it is taking), throughput (how much work is being done per unit of time), capacity (how much capacity exists), utilization (how much of that capacity is being used), and contention (how long is the queue) at each level in each subsystem that constitutes your services?
- For each of those metrics and for each of the key services, what do you deem to be normal behavior and at what point do you deem there to be abnormal behavior?
- Are you willing to share and be transparent about the operating status of your environment with your customers? Please do not tell us the answer to this question is CloudWatch because as you know, this tells us nothing about what is going on under the hood.
- What level of automation exists in the operation of the environment? In other words, what actions occur automatically completely without the involvement of human beings?
- What tools do your administrators use to manage the environment and what controls are in place to prevent these tools from causing widespread havoc? This is a crucial question because it is obvious that performing operations on groups of resources is necessary to run an environment like a public cloud at scale.
- Are there any Single Points of Failure (SPOFs) in your environment. If so, where are they and what are your plans to eliminate them? This is important because in the detailed explanation of this outage, it appears the index subsystem for S3 is a SPOF.
The single greatest concern about public cloud vendors
AWS is now over 10 years old (having been launched in 2006). If it has taken AWS 11 years to figure out that tools that can take groups of resources offline should have constraints built into them to prevent them from taking too many things offline at once, then what else remains to be figured out? In other words, what else about running an environment that is subject to rapid innovation at this kind of scale has Amazon not figured out yet? The lack of transparency as to what is going on under the hood amplifies this concern.
How can enterprises protect themselves in the cloud?
There is no doubt that the idea of outsourcing the management of their compute infrastructure to a public cloud vendor is an attractive business idea to many enterprises. However, if you have services and applications that need to work well and perform well all the time, your applications need to be resilient with respect to issues in that outsourced infrastructure.
This leads to the following potential recommendations:
- Build resiliency into your cloud-hosted applications. This starts with having them be distributed and resilient to SPOFs in the applications themselves.
- This also means that many of your existing applications that are not scaled out and not resilient may not be candidates for migration to public clouds.
- Build resiliency into your deployment. This means deploy across multiple Amazon Availability Zones or potentially even across multiple cloud providers.
- Keep a DR site in a collocation facility, a hybrid cloud or a private cloud.
- Keep the entire application in-house in your own private cloud.
Amazon’s outage on Feb. 28 has created a “what else do we not know that we do not know” moment. Amazon cannot address these issues by just analyzing and modifying its internal tools and procedures. Amazon needs to become transparent in terms of both the operational state of its systems and its own processes for managing those systems.
This article is published as part of the IDG Contributor Network. Want to Join?