How did Amazon screw up the internet this week? A typo
An Amazon cloud service outage disrupted chunks of the internet Tuesday, rendering some apps and website useless, while causing chaos behind the scenes for some companies.
Amazon Web Services is now explaining why its S3 service went out that day: A human typed something wrong.
That morning, workers were trying to debug an issue that was slowing down payment processing. Just after 11:30 a.m. CST, one of those employees punched in a command that was supposed to take a "small number" of S3 servers offline to help figure out what was causing the slowdown.
"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the explanation says.
So an accidental typo ended up taking down servers that are needed to manage metadata and handle certain commands.
Amazon had to restart those servers, and it "took longer than expected." During that time, a bunch of services weren't available.
And that's why things got all crazy.
Amazon Web Services says it's built to handle certain failures without any customer impact, writing they operate "with the assumption that things will occasionally fail." But they haven't done a full restart like was required this time, and didn't know it'd be such a process.
They've made some changes to how things work, including tweaking that command tool that the typo was punched into, so in future cases it'll remove systems slowly and prevent large numbers of servers from being taken offline.
"Finally, we want to apologize for the impact this event caused for our customers," the message read. "We will do everything we can to learn from this event and use it to improve our availability even further."
So there you have it. All it took was a typo to bring down apps and websites for an afternoon.