How we fixed the MySQL.com Power Outage
As many MySQL users noticed, MySQL.com and related sites disappeared from the Internet on Wednesday 22 July 2009 for about 10 hours. I’d like to give an update on what happened and what we’ve done to fix it, as such outages are nonacceptable. For this blog entry, I’ve talked to Adam Donnison, senior MySQL.com web developer/admin. Last Wednesday certainly also highlighted the amazing power of Twitter, where we could communicate with our community in real time using our @MySQL and @MySQL_Community identities as well as our personal ones.
So what happened? For the last several years, the MySQL.com servers were being hosted in Uppsala, Sweden where MySQL AB was headquartered before becoming part of Sun. During all this time, despite being the sole hosting location for MySQL, MySQL’s Web Team recorded a near “five-nine” uptime rate over 4 years. As we wanted to get rid of the single point of failure represented by MySQL.com being located in a single facility, we planned on moving all the servers to a new facility in Stockholm, Sweden with high available redundancy in Sun’s datacentres in the United States. Ironically, this was planned for last Saturday, 25 July 2009. As you can probably foresee, our plans changed.
Last week, the building in which the data centre resides in Uppsala, Sweden suffered a major power outage. It lasted long enough to overcome our extensive UPS and then damaged the main building power grid in a way that we could not be confident power would return to stability for days to come (and later, this was proven true). Global internal discussions were held, emails were sent and our Internal IT immediately started the move to the Stockholm Data Centre. Thanks, Ove Ewerlid, Jonathan Petersson, Thorild Selén and Danny Swälas of the MySQL IT team for excellent, focused 24×7 work at short notice!
Meanwhile in Adam’s own words: “As the servers were being moved over to Stockholm, the web team and I decided to finish activating the redundant data centre in the United States. This too was planned for the weekend so I had some work to do. We moved the most recent data sets and files and started our Master/Slave databases there and brought back a read-only version of the MySQL.com sites as quickly as we could.”
Adam continues: “The entire process of provisioning new servers was made easier by moving to the GlassFish Web Stack on our new Sun servers, which provided the key platform elements of the MySQL.com architecture: MySQL, Apache, PHP and Memcached. Instead of adding individual components and making sure they all played nice together, it was a single install and we were able to roll the site across from our existing LAMP stack with no changes to any software.”
Now, to be honest, we did learn a few lessons from our sites going down. Even though we knew about the disaster scenario that could happen in Uppsala and were prepared to move, we were caught unaware. We fully recognize that our external communication could have been better, and that our response could have been a bit faster. Our new disaster plans for the Web and Community sites will have to include such scenarios. Many in the community have given us suggestions on how we could improve our communication at these times and we appreciate and listen to all such feedback.
However, by the end of this last weekend, here’s what we have: Our servers are 100% back up-and-running, now redundant across two continents with data centres which are staffed and equipped with some of the best Sun Servers available. MySQL.com has survived over 10 years and tells the remarkable story of a database that grew to become the most popular open source database in the world. I’m very happy and proud that our web and IT teams have managed to move our web site and all its services so that going forward, we can have the highly available, highly redundant and high performance web site that you have come to expect from us.


July 29th, 2009 at 11:15
a) thanks for the openness, much appreciated and happy it’s all worked out.
b) Knowing Adam, I doubt he actually said that Glassfish sentence; sounds like marketing. remove?
July 29th, 2009 at 12:44
Looks like traceroutes from Asia, US west coast, US east coast, and places in Europe all going through mysql-gw-sec-c.bahnhof.net. Are you guys doing this as a warm failover? Also, your TTL is 3600 which means that you may have an hour of outage… intentional?
July 29th, 2009 at 12:48
@Arjen
a) thanks
b) sure, it’s Adam’s own words; do note he talks about the Glassfish *Web Stack*
July 29th, 2009 at 15:32
@Cory: As you say, the failover today isn’t seamless. The actions we’ve taken will mitigate major failures such as the one we had last week. In this case one must also take into consideration that 3600 seconds is worst case scenario and there’s tweaks we can do to lower this.
Our IT and Web teams will continue to work close to shrink this window as much as possible. Given the facility which we are in today, it’s highly unlikely that an event like this would occur again, but obviously any actions we can take to prevent further downtime will be evaluated and implemented as soon as we’re able to.
July 30th, 2009 at 3:49
Kaj,
It might also be worth mentioning that the support / enterprise stack was operational the full time. It has lived in both the US and Stockholm datacenters for almost a year now.
That stack is no longer my job but I am still proud that my hard work on it paid off.
July 30th, 2009 at 5:58
Arjen: The Glassfish thing is a return of the “call everything Java” or “Call everything .NET” craze we had a decade ago. Sun offers a bundle known as “Glassfish Web Stack”, which doesn’t include the product actually known as Glassfish. (At least last time I looked.)
While still at Sun I learned that the Java thing is still around: Sun sells a “Java Enterprise Messaging System” or something like that, which really is Sendmail, bash and Perl. No Java anywhere
August 1st, 2009 at 0:32
@Henrik - Actually, WebStack 1.5 includes the GF Server… although there is a clear stronger grouping around the AMP pieces. For example, the IPS repository and the main installer only covers the AMP pieces.
Although I’m a strong GF proponent, I find the name a bit confusing and I tend to drop the “GF” part and just call it “WebStack”