Amazon outages- Plausible causes
Tuesday, June 10th, 2008by Supranamaya Ranjan
The intermittent outages at Amazon on Friday (June 6) and Monday (June 9) along with the fact that we have no official word from Amazon about the reasons, are leading up to an exciting Internet potboiler of sorts. While we still don’t know the reasons for these outages, we can eliminate the following reasons:
- Using NarusInsight Secure Suite (NSS), we haven’t found any instances of a large scale network-initiated attack so far that could have led to these outages.
We did detect a Denial-of-Service (DoS) attack against the Internet Movie Database (IMDB), which is owned by Amazon. This attack lasted for about 2 hours starting at 9:52 am PST, which coincided with the downtime of Amazon and its affiliate sites. The attack volume averaged 3 Mbits/sec and it was a sophisticated layer-7 DoS attack where the attacker opened 500+ HTTP sessions in an attempt to stress out the CPU resources or clobber the bandwidth around IMDB. However, the attack volume as seen by NSS probes does not seem large enough to warrant an outage as big as this on IMDB or Amazon. At this point, the attack looks coincidental with the outage and not the cause for it.
- Similarly, NSS shows that Amazon prefixes weren’t hijacked by anyone and so we can eliminate Prefix Hijacking as the cause (read Prefix Hijacking of YouTube that happened earlier this year).
- A traceroute to Amazon’s prefixes doesn’t reveal any malicious Autonomous Systems (ASes) as trying to re-route the traffic in order to steal it, in what is known as a Path Hijacking attack. Note that in such an attack, an attacker injects himself in to the BGP AS PATH.
- Amazon’s DNS entries are also pointing to the right ip-addresses for their web servers, so we can eliminate DNS cache poisoning or other DNS related attacks.
The most plausible cause appears to be errors in their Content Distribution Network or with their load balancing. In their normal operation, CDNs are supposed to direct a user to the “best” web server that could serve them the content the fastest. Best could be defined as either the server that is the closest to the user (using metrics of either smallest number of hops or least round trip-times) or one which has the smallest workload currently.
However, yesterday I was always returned the slowest web server’s ip address while trying to resolve amazon.com via DNS. For instance, yesterday (June 9th), I was always returned the following address (http://72.21.210.11) which either took forever to load on my browser or when it came back, the page had no images. On the other hand, the other two ip-addresses that most likely point to different web servers (http://72.21.206.5/ and http://72.21.203.1/) would load much faster and normally.
So the most plausible cause for these outages seems to be either (i) CDN related where users are being returned sub-optimal web server ip-addresses or; (ii) An internal load balancing issue affecting their data centers that is affecting the rendering of response pages from the fragments such as images, text and dynamic queries or; (iii) Amazon is re-architecting their data centers to accomodate new service offerings.
What the truth is, only time or Amazon can tell.