The root cause of the entire issue, Amazon says, was a problem with the S3 (Amazon Simple Storage Service) billing system that was moving too slowly on the morning of February 28.
As an employee started to investigate, following internal procedures, he reached the conclusion to remove some servers from the S3 billing system. To do so, he needed to run a console command.
"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," Amazon said.
"The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region."
In simple terms, the employee accidentally deleted one of the AWS servers that coordinated S3 traffic among its different nodes.
The fallout was immediate, as some websites went down completely, while others just lost access to multimedia files stored on S3 servers located in that particular region [Northern Virginia (US-EAST-1) Region].
Furthermore, one of the impacted servers was Amazon's own status page, which for most of the outage showed that everything was running smoothly, even if around 20% of all Internet sites were impacted, according to an estimation by Shawn Moore, CTO at Solodev.
To make matters worse, restarting those crucial servers took more than usual, as they were restarted quite rarely, and the team wasn't accustomed to going through all the safety checks at full speed.
Amazon says it is currently implementing some changes to prevent a similar situation. For example, when removing servers (capacity), the tool Amazon uses will not go under a certain limit which endangers the normal functioning of the entire S3 network.
Additionally, Amazon plans to break down its network in smaller cells, so an outage like this will only affect a smaller number of customers. This operation was planned for the end of the year, but Amazon now moved it at the top of the priority queue.
The month of February has been riddled with typos. A typo in the source code of the Zerocash cryptocurrency allowed an unknown attacker to steal around $592,000. Similarly, a typo in a Cloudflare component caused a massive data leak for its clients, known as CloudBleed.