Our first site outage
This past Thursday, at approximately 3:30 pm Pacific Time, we experienced out first site outage.
I set up some basic monitoring through Pingdom which checks that our site is online and healthy every five minutes. I got an email at about 3:30 alerting me that there was a problem. At the same time, some error emails started arriving in my Inbox. Apparently, the master database was not happy and our application could not connect to that machine. I tried logging into the machine to no avail…my ssh attempt just sat there hanging. Grrrrr…
Now, this isn’t a major problem, just a major inconvenience. We have two databases: a “master”, which does all of the hard work, and a “slave” which serves as an exact mirror of the master. I sat there chatting with Ryan on IM thinking about the best options. Since we use Rackspace Cloud, getting a new server online takes all of 2 minutes. I figured that would be the easiest thing..bring up a new machine, recreate the DB from the slave and reconfigure our we application to use this new one. Just as I decided to do this, another email shows up, this time from Rackspace:
This message is to inform you that the host server your Cloud Server ‘rl-db1′ is on became unresponsive at 5:25PM CDT today. As a result we have rebooted the server and will investigate the issue. If the problem arises again we will proceed with a hardware swap to maintain the integrity of your data.
The Rackspace Cloud
Ok, that sounds promising. About 2-3 minutes later, I was able to log in to the master DB machine. Everything looked fine and pulling up RoastLog showed the website working as usual.
All-in-all, roastlog.com was down for about 20 minutes. What was the consequence of this? Not much really. The RoastLogger application will save all data locally when you hit the “save” button. In a case like this where it can’t upload a roast to the website, it will simply keep that data and try again later. Sure, folks couldn’t log onto the website and view their data, which sucks, but the good news is that no data was lost. Even in the case where the master database was gone for good, we would have been able to recreate it from the slave DB. In the extremely odd case that both the master and the slave were someone erased and unrecoverable, we take snapshots every 3 hours and upload those to Amazon S3, which pretty much guarantees data safety.
Hopefully this wasn’t a major bummer for any of our users. It was actually a fairly good exercise to remind us that things like an entire machine going away can, and does, happen. Our job is to write software which expects these types of failures, and handle them gracefully. I’m confident and proud that roasters can already trust that RoastLogger handles these cases, and won’t lose any of their data when there’s a hiccup in the Internet.