Amazon says AWS outage was caused by software problem

Amazon.com said automated processes in its cloud computing business caused cascading outages across the internet this week, affecting everything from Disney amusement parks and Netflix videos to robot vacuums and Adele ticket sales.

In a statement Friday, Amazon said the problem began Dec. 7 when an automated computer program — designed to make its network more reliable — ended up causing a “large number” of its systems to unexpectedly behave strangely. That, in turn, created a surge of activity on Amazon’s networks, ultimately preventing users from accessing some of its cloud services.

“Basically, a bad piece of code was executed automatically and it caused a snowball effect,” Forrester analyst Brent Ellis said. The outage persisted “because their internal controls and monitoring systems were taken offline by the storm of traffic caused by the original problem.”

Amazon explained the failure in a highly technical statement posted online. The problems began about 10:30 a.m. New York time on Dec. 7 and lasted several hours before Amazon managed to fix the problem. In the meantime, social media lit up with complaints from consumers angered that their smart home gadgetry and other internet-connected services had suddenly ceased to work.

Some experts said the explanation doesn’t help users fully understand what went wrong.

“They don’t explain what this unexpected behavior was and they didn’t know what it was. So they were guessing when trying to fix it, which is why it took so long,” said Corey Quinn, cloud economist at Duckbill Group.

AWS is generally a reliable service. Amazon’s cloud division last suffered a major incident in 2017, when an employee accidentally turned off more servers than intended during repairs of a billing system. Still, the latest outage reminded the world how many products and services are centralized in common data centers run by just a handful of big tech companies like Amazon, Microsoft Corp. and Alphabet Inc.’s Google.

There is no easy fix to the problem. Some analysts believe companies should duplicate their services across multiple cloud computing providers so no one crash puts them out of commission. Others say a “multi-cloud” strategy would be impractical and could make companies even more vulnerable because they would be exposed to everyone’s outages, not just AWS’s.

“We know this event impacted many customers in significant ways,” the company said in the jargon-filled statement. “We will do everything we can to learn from this event and use it to improve our availability even further.”

Never miss a story: Follow your favorite topics and authors to get a personalized email with the journalism that matters most to you.