Hello everyone, Mark from DS again. With more and more companies using virtualization, such as Microsoft Virtual Server, Server 2008 Hyper-V or VMWare, in their environments these days you may end up in the following situation I recently worked on:
1) Customer wanted to roll back one of his DC’s in his test environment to basically “back out” of some changes that had been made recently. This was a single domain forest that consisted on two Domain Controllers. Both of the DC’s were running Windows 2003 SP2.
2) Virtual Machine snapshots were being taken instead of normal system state backups.
3) They restored one of the DC’s from one of the snapshots.
4) Replication was broken.
Replication symptoms consisted of the following:
1) The Netlogon service is in a paused state.
2) In the Directory Service event log a replication error was logged, Source was NTDS Replication with the Event ID 2095.
3) Also in the Directory Service event logs were two warnings, Source was NTDS General with the event ID’s 1113 and 1115.
Here are samples of the Directory Service event log events with the description of the event.
Event Type: Error
Event Source: NTDS Replication
Event Category: Replication
Event ID: 2095
Description: During an Active Directory replication request, the local domain controller (DC) identified a remote DC which has received replication data from the local DC using already-acknowledged USN tracking numbers. Because the remote DC believes it is has a more up-to-date Active Directory database than the local DC, the remote DC will not apply future changes to its copy of the Active Directory database or replicate them to its direct and transitive replication partners that originate from this local DC. If not resolved immediately, this scenario will result in inconsistencies in the Active Directory databases of this source DC and one or more direct and transitive replication partners. Specifically the consistency of users, computers and trust relationships, their passwords, security groups, security group memberships and other Active Directory configuration data may vary, affecting the ability to log on, find objects of interest and perform other critical operations. To determine if this misconfiguration exists, query this event ID using http://support.microsoft.com or contact your Microsoft product support. The most probable cause of this situation is the improper restore of Active Directory on the local domain controller. User Actions: If this situation occurred because of an improper or unintended restore, forcibly demote the DC.
Event Type: Warning
Event Source: NTDS General
Event Category: Replication
Event ID: 1113
Description: Inbound replication has been disabled by the user.
Event Type: Warning
Event Source: NTDS General
Event Category: Replication
Event ID: 1115
Description: Outbound replication has been disabled by the user.
If you run the command repadmin /options <The DC Name> you can verify that inbound and outbound replication is disabled. You will see something similar to this:
Current DC Options: IS_GC DISABLE_INBOUND_REPL DISABLE_OUTBOUND_REPL
With more and more companies using Virtualization to replace actual physical hardware, especially in test environments, I believe we are going see more issues such as this one. This can also happen in situations where you are converting physical hardware to virtual machines which we refer to as “PtoV” (physical to virtual).
First we need to understand some basic background information regarding Active Directory (AD) replication. Domain Controllers (DC’s) use Update Sequence Numbers (USN’s) to track the updates that need to be replicated between replication partners. Every time a change in made to the data in the directory the USN is incremented to indicate a change was made. For each directory the DC stores, USN’s are used to track the latest updates that a DC has received from each source replication partner. Each DC also has a table where it knows about every other DC highest USN that stores a replica of that directory partition. Each DC also has a value on its NTDS Settings object called an invocation ID. This value is used to indentify its version of its local AD database.
There are two values that use USN’s during the replication process. One is the up-to-dateness vector , the other is the high water mark . The up-to-dateness vector is a value that the destination DC maintains for tracking the originating updates that are received from its source DC’s. When the destination DC requests its updates for a directory partition it supplies its up-to-dateness to the source DC who can use that value to reduce the set of attributes it needs to send to the destination DC. The source DC will send its up-to-dateness vector value to the destination DC once the replication cycle has completed. The high water mark is a value that the destination DC maintains to keep track of the latest change it has received from a specific source DC for an object in a specific directory partition. This value prevents the source DC from sending out changes to the destination DC that have already been applied by the destination DC.
The invocation ID is a GUID value that identifies the directory database running on a DC and is maintained separately from the identity of the server object. The server object identity never changes but the identity of the directory database (invocation ID) will change when a system state is restored by using the Microsoft API’s. All the domain controllers keep track of the directory database on its source replication partners. Both the up-to-dateness vector and the high water mark refer to the invocation ID so that other DC’s know which copy of the AD the replication is coming from.
I know this can be confusing so let’s add some graphics that may help to understand this better. Let’s say we have two DC’s, DC1 and DC2. Both of these DC’s are running as Virtual Machines on a host machine running your favorite Virtualization Software. For all intents and purposes we are assuming that replication is working fine and both of the DC’s are up to date on replication. Before we start, we take a “snapshot” of DC1. As we can see below we add a new user “Jeff Smith” on DC1. The USN is incremented from 4710 to 4711 on DC1.
Now we replicate the new user to DC2. DC1 will notify DC2 that it has changes that it needs to replicate. DC2 will then request the changes and send DC1 what it thinks is DC1’s high water mark is. In this case DC2 thinks that value is 4710 so that is what it sends. When they are done replicating DC1 will send DC2 its up-to-dateness vector so DC2 will have the new value.
Now let’s suppose that other changes in the environment are occurring and replicating as they should. “Jeff” logs on and changes his password. When he does this DC2 is the DC where the change takes place. This will increment the USN on DC2 as it was 2452 and we increment the USN for DC2 to 2453.
Next we replicate that password change over to DC1. DC2 tells DC1 that it has changes it needs to get. DC1 will send DC2 what it thinks DC2’s USN is, in this case DC1 thinks DC2 is at 2452.
Once they are done replicating DC1 USN will be 5040 and DC2 will know it DC1 is at 5040. DC2 will be at 2453 and DC1 will know that value as well.
Now you want to roll that one DC back. You apply the snapshot to the DC as a restore procedure. When this happens, the invocation ID remains the same, the USN’s are “rolled back” to the time the snapshot was taken. Now when the replication process starts the “snapshot” DC requests changes from its source DC it sends the old up-to-dateness vector to the source DC. The source DC sees this value and it knows what the value should be and they are different. The value sent has a lower value then the source DC has in its table for the destination DC. The response sent back to the destination DC by the source DC basically telling the destination DC its database is out of date. When this happens we have built-in protection so that the destination DC will take measures not replicate with other. This is referred to as a “USN rollback” situation.
The protection that the USN rollback system will take will be is:
1) Pause the Netlogon service.
2) Disable the inbound and outbound replication.
To correct this situation we need to do the following on the DC that has the roll back issue.
1) Forcefully demote the DC by running dcpromo /forceremoval . This will remove AD from the server without attempting to replicate any changes off. Once it is done and you reboot the server and it will be a standalone serve in a workgroup.
2) Run a metadata cleanup of the DC that was demoted per KB article 216498 on one of the replication partners.
3) If the demoted server held any of the FSMO (Flexible Single Master Operations) roles then use the KB article 255504 to seize the roles to another DC.
4) Once replication has occurred end to end in your environment you can rejoin the demoted server back to the domain then promote to a DC.
To prevent this from happening adhere to the following best practices:
1) Do not use imaging software to take an image of the DC.
2) Do not take or apply snapshots of the DC.
3) Do not shut the Virtual Machine down and simply copy the virtual disk as a backup.
4) If you have the ability to “discard changes” as you do if you are running “Virtual Server 2005 R2”, do not enable this type of setting on a DC Virtual Machine.
5) Use NTBACKUP.EXE, WBADMIN.EXE, or any third party software that is available as long as it is certified to be AD-compatible to take system state backups.
6) Only restore a system state to the DC or restore a full backup.
875495 How to detect and recover from a USN rollback in Windows Server 2003
Appendix A: Virtualized Domain Controllers and Replication Issues
Backup and Restore Considerations for Virtualized Domain Controllers
- Mark Ramey
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.