Decoding Downtime for Amazon’s Cloud Services

Decoding Downtime for Amazon’s Cloud Services

As Amazon Web Services faced a major downtime, many popular websites running on Amazon’s cloud services suffered interruption in their services. The issue has been explained by Amazon in a recently released report.

As per initial reports, the issue was caused by a typo by one of the server administrators. While entering a routine command to remote servers from a S3 subsystem, the employee entered a larger number than intended. Two other S3 subsystems were supported by these servers and these subsystems were keeping metadata and other storage information for a massive region. This led to a major outage for Amazon AWS.

Amazon has detailed the steps it will take to prevent such issues for happening in future. In the recent years, the systems under Amazons AWS segment have increased at a fast speed as many online businesses shifted to cloud. Due to this growth, the restart of systems took much longer than expected. This led to a downtime for many services and embarrassment for Amazon.

Amazon informed, “S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.”

Regarding preventing similar issues in future, Amazon informed, “We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.”

Share Share