Saturday, August 12, 2006

Testing Disaster Recovery

This week I was able to test part of the White Light Computing disaster recovery plan for my primary development laptop. I experienced the siuation I believe every computer user fears most with their machines: hard drive failure.

Wednesday evening I returned home from one of my clients. I pulled my laptop out of my backpack and put it on my desk. It started whining and screeeeeching, and making the horrific clicking sound you hear when the drive is beginning to fail. I was very fortunate in the fact that the drive did not immediately die, and I was able to get it to stop making the bad noices by placing it absolutely level on the desk (in case you are wondering, I normally place it in a port replicator on a stand on the desk which is not level). If I picked it up it would click and screech.

I immediately connected it to my backup drive. I am using two Western Digital 250MB USB 2.0 drives to back up each of the data partitions using Powerquest's DriveImage (now owned by Symantec and the technology rolled into Ghost 10). I ran the back up jobs normally scheduled each night. I just backed up the C: partition on Saturday and have not loaded any new programs so I figured I was okay on this.

Thursday morning I called HP support and reported the problem. The young lady was nice enough to listen to the clicking and screeching, but had to follow procedure to get my drive replaced. I told her I was willing to do so as long as none of the steps was FORMAT C-colon. The BIOS drive diagnostics showed nothing. I wasted more than an hour running these which is time lost forever. I told her I was going to buy a replacement drive anyway and needed the drive specs. She offered to sell me a 80GB drive for US$628. I asked her if she was offering me a new laptop for that price.

I picked up a new 120GB drive for US$200 at the local CompUSA. It took me 15 minutes to install it. I had some work to do at one of my clients (who is kind enough to provide me a boat anchor to work on when I am in their office). I was hoping to restore the drive while working on their conversion code. For some reason I could not figure out how restore the images from the DVDs I burned and I did not think to bring my USB drive.

I was able to handle all my email via the Web interfaces. I realized at the time I did not have access to any of my email templates I use regularly. I have five email templates used to correspond with developers who buy my developer tools, and a couple more for reporting hours to clients. I also have several templates to answer common questions on the various FoxPro forums. This was failure point number one. To fix this I will be putting the templates on a thumb drive.

Thursday night was our monthly DAFUG meeting. Cathy Pountney visited and rehearsed her testing session. It was a great meeting, but it delayed the reload. I got home after 10:00 and spent some time with the family. They were watching Harry Potter 2 and I got sucked in.

The reload started at 11:30. I was anticipating each partition restore to take a little less than it takes to back the partition up because I was verifying the image before loading. The 25GB C: partition took less than an hour (it takes close to 2 hours to back up). I did not watch it and was planning on it running overnight. I could not sleep so I checked it and it was already done. I kicked off the 15GB D: partition that took fifteen minutes, and the 30GB J: partition took a little more than a half hour.

Rebooted and the machine started just like it did before the drive crash. SWEET!

I had to change the drive letter for the J: partition as it was assigned to F:. I had to reactivate the IntelliSync software (used to sync my Treo phone). I was half expecting to have to reactivate Windows XP because the drive size changed, but it did not ask me to do so.

The one thing I was surprised about is Zone Alarm lost its marbles. I had to go through the cycle of approving Internet access for several applications. This is not a big deal, but aggravating. So action item number two is to export out my Zone Alarm settings periodically.

I had the machine in working order by 3:00am including downloading emails, checking a couple of newsgroups, and letting FeedDemon refresh the RSS feeds.

The only other problem I have experienced is files saved to the desktop. Under my partition backup scheme, these files are only backed up once a month when I image the C: partition. I happened to create a spreadsheet Monday morning and saved it to the desktop. This is really rare for me, but it pointed out a hole in the disaster recovery plan. It only took me a couple of minutes to recreate the spreadsheet and in this case I still have a working drive I could have recovered it from. If the drive was completely dead I could have lost nearly a month of work. Action item three is to ensure the Documents and Settings folder gets backed up more frequently and I only have shortcuts on the desktop.

I lucked out. The situation could have been catastrophic. It was not. I did lose much of a day, but was still able to deliver some work for one of my clients. I had to delay another client for something I was hoping to deliver Thursday, but he was very understanding.

I am going to buy a second 120GB drive later today and go through this process on a regular basis. It was an excellent way for me to test the recover process. The thing that slowed me down was the time to go get the drive and the time to restore the images. If I had the drive I literally would have been up and running in a couple of hours.

I only tested out part of my disaster recovery. What if my machine died completely? I have a spare laptop (older but still working as a test machine) and the data images I can use to restore the latest files as apposed to restoring the partition. Now if a fire destroyed both laptops I would be in trouble, but this is a risk I am managing.

The questions of the day are:
  1. Do you have a disaster recovery plan?
  2. How good is your disaster recovery plan?
  3. How long would it take you to get back in business?
  4. Would you be able to recover all your data?
I am very happy with the way DriveImage saved my day. I have used it before to restore files, but never to restore entire partitions. Yes, I never really fully tested my disaster recovery! I fell into the 95 percentile who don't. I am ashamed to admit this, but I can also say I am part of the luck few (maybe the 5 percentile) who survived despite not doing the appropriate testing.

So how well would you survive losing the drive on your primary development machine?


Post a Comment

<< Home