I have the pleasure of working for the Exchange Critical Situation team in
With that in mind I would like to share some basic things that everyone who is running Exchange needs to do in order to be prepared. While a lot of this is not "brand new" knowledge, it does address many of the things that I see go wrong in the cases that I work on. DR Happens - Will you be ready?
Service Level Agreement
Here the common scenario we run into is that people have not thought about what is important if the Exchange server has a failure. I will be on the phone working with a customer and we determine they need to do a restore from backup. At this point (especially with 2003) there are a few ways we can go about doing that. What is important to your company will help to determine the best way to do the restore.
Not having this information decided in advance leads to long conversations with management to get the decisions made. This can drastically slow down the pace of recovery. In some rare cases I have ended up spending more time talking about what we could do then we spent actually doing it.
What you need to have decided ahead of time are a few simple questions:
1) Which one is more important to my users: Restoration of Mail Flow or Recovery of Historical Data?
2) How long can we afford to be down with out any Mail Flow?
3) How long can we afford to be down with no Historical Data Recovered?
4) If Historical Data is our top priority at what point does Mail Flow become more important and vice versa?
These four questions will help to define your options and what you can and cannot do in order to restore Exchange to the functional level that you desire in the minimal amount of time. Having these decided in general is the first step to having a smooth disaster recovery.
Database Size
This is another common situation that I run into. We are doing a restore from tape of a 120 Gb storage group and suddenly it is realized that the restore is going to take another 18 hours to finish and it is right now. So it is going to cut into the business day and that can’t be allowed to happen. Now we end up in a panic situation where people are willing to try any crazy scheme they can think of to get it back up before the morning.
This situation almost always comes about because people plan their database size based on their disk size and not the limitations of their Backup and Restore plan. Database size should be determined almost solely by your
So what you need to do with database size is work it backwards. Determined how long you can be without Historical Data. Then determine how fast you can restore from tape. Use those two numbers with some padding for troubleshooting when the failure is discovered and some padding for log file replay after the restore is done to determine how large your databases can be.
You also need to figure out if that number will hold when you have to restore a whole storage group of 5 databases or what if you have to restore a whole server of 20 databases? In most cases you will probably want an
Practice
Now let us say that you have been diligent and you have your
What you need to do is practice as if your Exchange server had failed. We call this process running a Fire Drill. You should run an Exchange Fire Drill at least once a Quarter to keep everyone up to date on how the restore process works and how to perform it.
To run a Fire Drill you should setup a server (beefy workstation) with sufficient drive space to accommodate the Exchange database from at least one of your servers. You would then set it up on its own network with its own Domain Controller (if you are not testing full server restore then this can be a new domain). Install Exchange to the server and your backup software and make sure you can get access to the data on tape.
Now you are ready to go. Come in the next morning and declare “The Exchange server/Storage Group/Database (which ever you want to practice) just went down. We need to get it back up and running we have “X” hours to do so.” That X hours should be the time from your
Write a Cheat Sheet
Now you have gone thru the process of doing a Fire Drill and you learned what worked and what didn’t. You have figured out all of the little check boxes and the fact that you have to keep the intern away from the tape drive power button. Take all of the knowledge and the make yourself up a cheat sheet for next time.
This cheat sheet should contain an outline of the steps and processes that you need to go thru in order to do your planned restore. It should include reminders of the little steps that you found are easy to miss. If possible you should also include screen shots of all of the settings you need to have to do the restore on your backup software. This cheat sheet will basically become your Restore Bible when it comes time for the real thing.
Practice some more
Last but not least you need to bring that cheat sheet out on a regular basis and practice with it. Make sure your organization is doing an Exchange Fire Drill at least once a quarter. Make sure that not just the Exchange guy is there for that, he should have a backup, in case he is on vacation, which can use the cheat sheet if necessary. After each of these practice sessions go back over the cheat sheet and make sure nothing needs to be updated.
If you do these basic simple things you will be more prepared for when an Exchange Disaster does happen. This should ensure that your disaster recovery goes smoothly with the minimum amount of down time. With Disaster Recovery mistakes are measured in hours so it pays to be prepared.
References:
Exchange Server 2003 Disaster Recovery Operations Guide
http://www.microsoft.com/technet/prodtechnol/exchange/2003/library/disrecopgde.mspx
Worksheet: Disaster Recovery Preparation for Exchange Server 2003
http://www.microsoft.com/technet/prodtechnol/exchange/2003/drchecklist.mspx
Preview: Exchange Server 2003 Disaster Recovery Planning Guide
You Had Me at EHLO.