Disaster Recovery

MichaelHildebrand · ‎Sep 19 2018

First published on TechNet on Feb 13, 2012

Have you ever tried to restore a server? What about a Production server? How about in the middle of the night? It never goes smoothly. Your cellphone never stops ringing. Often, the only thing that gets the server recovered is your own ingenuity and rock-star efforts. Let’s spend some cycles and try to get ahead of this. As PFEs, one of our major roles and responsibilities is to help our customers realize “the gaps” and assist them in addressing them proactively. After an eye-opening conference call discussing recovery plans, or lack thereof, I felt even more compelled to create a post with some DR considerations. Hopefully, this will stir some thoughts and discussions (and ACTIONS!) around the matter of recovery. Recovery can be defined as (among other things):

To return to health
To return to normal state
To gain back something which was lost

In our World of IT, we could be doing any or all of these actions during what we often refer to as “Disaster Recovery,” or “DR.” It could be from a natural or man-made disaster or other large-scale event.

Fire/flood/storm
Terrorism or war
Facility malfunction

It could be a rogue admin or disgruntled employee. Often, it was due to an IT Pro making an innocent mistake – either small or large-scale.

Even with the confirmation prompts of most actions within Windows, people are still, well, ‘human.’
Anyone been on the recovery end of a script running with Admin-level credentials but not behaving as expected? Whoa daddy. That’s likely the time when you discover that backups have been failing. Since the spring. Of 2008.

Consider the statement: We do full backups of the ‘whole’ server, so in order to recover after an outage, we would simply do a full recovery of the box and be done. BUT Many times, a ‘full’ server backup doesn’t get key files – such as those files that are in use. DBs, transaction logs, application exe files, etc, are often not backed up during backup jobs via default settings or without special agents. We usually don’t realize this until we’re in dire straits. Or, perhaps, there is a Scheduled Task that is supposed to pause/quiesce the app/DB so the backup can get a copy of the proper flat file(s)?

However, the Task isn’t being monitored and it hasn’t run for 9 months (since the svc acct got locked out and we’re not monitoring it with SCOM). Also, since that last backup 9 months ago, the app owner has upgraded the app two versions. Consider the statement: We test recovery of our systems at the annual/recurring DR exercise/effort/mtg (you do have one of those, don’t you?) BUT However, as a “year in the life” passes for a system or server, it gets patched, service packed, drivers updated, settings changed (or drift), etc. Sometimes, the steps that enabled you to recover the system during the last DR exercise no longer work and the recovery suffers an epic failure. BE PREPARED – as much as you can. Like many things, DR is always a work in progress and always changing as our systems evolve, get patched, updated or otherwise changed. Be vigilant! Be disciplined! Add Recovery to your normal work routine so it doesn't catch you off-guard. Consider recovery before a system is even deployed. Make sure it is part of the design. Test the recovery design prior to deployment and again at regular intervals. One tip is to add recovery testing to your own day-to-day work items.

Consider using Outlook and Recurring Appointments with Reminders
- Monthly – test recovery of a test OU and its test contents
- Quarterly – test recovery of a complete test server and it’s test applications/services
  - Isolated or other offline environment
  - Bi-annually – test recovery of an entire Domain Controller (a test DC or other non-production impacting)
  - Annually – perform a more formal shared DR exercise
The Outlook Calendar method helps by blocking out Calendar time for this
You can also Invite others to these Outlook events
The Outlook Calendar method makes it all just a bit more official and formal

Now for a few DR pointers. Much of this is obvious and self-evident. It is painful, though, how often we neglect or forget the obvious. Document. Document. DOCUMENT!

Have two or more locations for Documentation such that a disaster to the system(s) that store your Docs doesn’t render you completely scrambling.
Don’t underestimate the value of a hard-copy, even if it is a bit dated, it’s better than nothing
Make sure there are application-specific docs that get tested/reviewed
- Often, the app was installed 6 years ago and no one on the current team even knows where the install bits are stored. The woman who knew the app left the company and took to a life of wandering the forests; she hasn’t been heard from since the spring of 2004.
- Application pre-requisites/details
  - DotNet versions?
  - Service accounts? (local or Domain-based)
  - Specific or non-standard NTFS or registry permissions?
  - Non-standard User Rights or other local Policies or Group Policy settings
Track application service releases/updates/etc – so you’re able to get back to where you are via clean install + updates, if needed
Have a selection of these accessible:
- CD/DVD blanks
- USB thumb and bigger drives
- 3 ½” floppy disks – if you need one of these, they can be very hard to find these days
Some folks have mature “Configuration Management Database” systems (CMDB) to track server/application personality Information and settings

SCCM can help automate a great deal of this personality information via Inventory jobs
CMDBs are extremely helpful but many times, they are not running on a ‘highly available’ system and during a DR (exercise or real) might not be available. Examine your environment to see if you’ve painted yourself into a corner like this

Again, don’t be afraid of hard-copy – just be sure to secure it. There’s nothing better than a big ol’ DR binder when you need it.
Consider storing the following info as a good start

HDD sizes (especially C:)
C:WINDOWS or C:WINNT?
Service Pack levels
Standard/Enterprise/Datacenter/R2?
x86 vs x64?
Windows Firewall – custom ports/settings
Custom or non-standard Local Policies, reg entries, GPOs
Local Admin pwd (hopefully as part of a process that is managed/on-going)
TCP/IP info

Static routes
NIC settings and info
- Don’t forget NIC speed/duplex
Hardware config/info
- Driver versions
- BIOS versions and custom settings (i.e. virtualization, power mgmt, etc)
- Storage/array configs/logical drive layout

For AD-specific recovery, consider the following as a start:

GPOs – are you backing up your GPOs?

Consider Powershell and/or GPMC scripts

OU information along with GPO link information

Note, GPMC backups do not backup the GPO links (they’re an aspect of the OU, not the GPO itself) but the link information is recorded in the GPO report within the backu
OU permissions/delegations

Consider Powershell and/or a DSACLs script

Directory Services Restore Mode (DSRM) Password

This is set on EACH DC independently and is very often poorly managed (if at all)
However, this can now be sync’d to a Domain account

http://support.microsoft.com/kb/961320

Current, accurate location of servers

In a large datacenter, simply finding the right physical server can be a maddening and high-calorie-burn endeavo
Virtual servers have their own set of ‘hide and seek’ issue

Tested recovery/boot CDs for pwd reset, dead server revival/data-harvesting/etc

Many times, the storage drivers on these need to be updated or they won’t ‘see’ the drives and can’t find the Windows installation
The Microsoft DaRT tool can help in this regard

http://technet.microsoft.com/library/ee532075.aspx

Hopefully, the information here reminds you of DR, gets you thinking about DR, brings up an idea or two about DR, or even stirs you to setup some Outlook appointments. Now, take action and be at least a little better prepared.

Cheers!

Products (49)

Special Topics (26)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Disaster Recovery