Here’s Why Amazon’s Cloud Suffered a Meltdown This Week
Apparently all it takes to bring down the Internet isn’t a virus or malware or a well-organized, state-sponsored attack. A typo will do the trick.
Amazon on Thursday revealed the cause behind the meltdown at its data center in Northern Virginia two days ago, causing the Internet to go haywire. Numerous companies running their services on Amazon Web Services—Amazon’s cloud and data center infrastructure business—sustained outages and problems as a result, including popular workplace productivity products like Slack and Trello.
Amazon (AMZN) broke down what happened to its Simple Storage Service nearly play-by-play over nearly a six-hour time frame, starting at 9:37 a.m. PT/12:37 a.m. ET through 1:54 p.m. PT/4:54 p.m. ET. Also known as S3, Simple Storage Service is a popular option for businesses looking for minimal cloud setup and storage for cloud-based applications. It also promises “99.999999999% durability.”
Get Data Sheet, Fortune’s daily tech newsletter.
This is the incident that started it all, as described by Amazon:
The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.
Translation: The Amazon employee made an error—a simple typo with massive repercussions. Amazon offers a highly-technical, transparent breakdown of what happened, including that the initial error forced a number of other Amazon cloud storage and processing products to go out of service as well.
As Amazon put it, the system needed “a full restart.”
Such an event demonstrates why Amazon—as well as other tech giants like Microsoft (MSFT), Google (GOOGL), Facebook (FB), and Apple (AAPL), among many others—are racing to build data centers upon data centers worldwide, closer and closer to their customers to ensure redundancy and not an event like this.
Amazon said it learned from the incident, promising it will be putting new safeguards in place: “We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future.”