A series of unfortunate events occurred, but EditGrid is there up and running 70 minutes later. It is normal our customers will complain to us for such a long outage, especially those who use our service for the serious work. I would see this as a lucky event which allows us to experience what is disaster recovery. Lucky? Yes, not many Web 2.0 start-ups got an opportunity to demonstrate their robust recovery plan (or the lack thereof).
Since we believe that EditGrid is not only known for its amazing amount of features, but also the transparency that we deliver to our users, I will try my best to explain all the work we have done during the 70 minutes of outage. In addition to your applause for our having done a good job, this article serves another purpose, that we owe publicly our users some improvements to the system that are of critical help to prevent another such incident from happening, or reduce the affected time even when it occurs unfortunately. If you are impatient, you can skip to the last paragraph.
We begin with some background. It was just yesterday, that we got an alert on our production system, a preventive alert that warned us of our growing user base and usage exceeding the capacity of our system. As a result, we deployed an extra machine into the production server farm, and arranged another stand-by machine.
Today at around 15:10 (all time are UTC hereinafter), we received an alert on our production system, the same alert occurred yesterday. Obviously, we begin to believe that the alert is again due to the same preventive warning, which is fired when our system responds too slowly to requests. For the first few minutes, we examined the servers one by one, trying to locate the cause. We examined every component but cannot find a clue on why they suddenly become slower than normal.
Starting from around 15:20, we noticed that the number of requests going into the servers started to reduce sharply, and eventually stopped a couple of minutes later. We believed that this is due to failure of the automatic recovery mechanism. We then went through all the servers, attempting to restart each component manually. Interestingly, one specific component, after being stopped, took forever to start.
At around 15:40, we eventually identified it to be a failure of the database server. The database process is running, but when we issue it a request, we simply got no response, not even an error. Having confirmed this behavior using a few different machines, we believe that it has to be the problem of the server itself. We decided to restart it. The result is what everyone can expect. Upon start-up, the server immediately crashed.
We tried a vast number of varying configuration parameters, but the database server still refuses to start. We attempted to resort to converting the backup database server into the primary one. It turns out that due to a totally unrelated event, *all* the backup database servers stopped replicating the primary server a number of hours ago. If we wanted to promote it as a primary server, we have to wait until it as replicated all the changes till the time we stopped the primary server. We have stop-watched, and this will take over an hour to finish.
On the other hand, we began searching the web for solutions to the reasons the server crashes. We identified a bug report for the version we are running that sounds related. Coincidentally we have recently started testing a newer version of the database server in our QA environment. With a little bit of hesitation, we copied the new version into our production system (which takes some time due to the overseas file transfer) and attempted to start the new version. It worked!
It only seemed to work. It works when we created a spreadsheet, but not when we attempt to login. Then we noticed a lot of errors from the server, complaining about invalid indexes on some tables. We immediately executed some SQL queries to repair the corresponding table indexes and everything really worked. Time flew, and it is already 16:30.
Can we do better? We are not in a position to make sure our database have no bug. But we should be alerted when the backup database server failed to keep synchronised with the primary server. We should also have better mechanisms, than enumerating all servers and components, to identify the faulty one. In the best case, the outage can be much shortened. We will definitely learn from all events and keep our service stable. In any case, you can rest assured, that your data are still here, though it may take some more time for them to be recovered.