Corrupt AD PDC breaks trust - How to gracefully recover

Copper Contributor

Hi,

 

I have a small network of Windows Server 2016 based VMs in Azure, with single forest and AD via a PDC & BDC. 

 

FMSO roles are handled by the PDC, both DCs host DNS - PDC pointing to itself and BDC pointing to PDC and itself as secondary.

 

After several years of smooth sailing, I recently awoke to find a plethora of trust & authentication issues within the network after the PDC shut down unexpectedly. 

 

DCDiag on PDC was reporting:

 

The host <guid>_msdcs.cdsloud.net could not be resolved to an Ip address
Check DNS server, DHCP etc
Error while checking LDAP and RPC Connectivity, check your firewall settings

 

DCDiag on BDC was reporting:

 

Starting test DFSREvent :
There are warning or error events within the last 24 hours after the SYSVOL has been shared.
Failing SYSVOL replication problems may cause GP problems

Starting test KnowsOfRoleHolders:
The target principal name is incorrect
Warning W01 is the Schema Owner but is not responding to DS RPC Bind
LDAP bind failed with error 8341
A directory service error has occurred.

Starting test Replications:
A recent replication attempt failed
The replication generated an error (1256)
The remote system is not available

 

There are many errors in the Event Logs in the BDC (and no doubt PDC). 

EG:

The DNS server has encountered a critical error from the Active Directory. Check that the Active Directory is functioning properly. The extended error debug information (which may be empty) is "0000208D: NameErr: DSID-03100245, problem 2001 (NO_OBJECT), data 0, best match of:
'CN=washington-02,CN=Servers,CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=<domain>,DC=net'". The event data contains the error.

 

This is a production environment and downtime needs to be avoided as a priority, so while I'd love to understand precisely what, how & why has occurred here, I've realised that the PDC needs to be turned off or the trust / authentication issues quickly make our systems unusable. 

 

As such, the PDC is currently off and i'm looking to a path forward to re-stabilise the systems.

Questions I have at this stage are:

 

Am I on borrowed time with FSMO being on a DC that is offline? 

Am I better planning for an extended outage where I can turn the PDC on and try a "graceful" demotion via Server Manager etc  OR can I safely do a manual demote / delete with PDC off via AD Users & Computers (RSAT) on the BDC? Will a clean up as described here: 

https://learn.microsoft.com/en-us/windows-server/identity/ad-ds/deploy/ad-ds-metadata-cleanup

automatically transfer the roles to the BDC? Will the second approach (manual delete with PDC off) avoid downtime for the systems and will the members require reboots?

Advice from the community would be much appreciated.

 

 

1 Reply

@rvntech For anyone interested - I attempted to transfer the roles from DC1 to DC2 and then Demote DC1 gracefully however various errors prevented either of those from working.

 

I needed to seize the roles from DC1 in powershell (ntdsutil errored), Remove the DNS role from DC1 and then Delete DC1 from Domain Users & Computers on DC2.

 

I ended up having to delete a lot of references from DNS on DC1 & DC3 (new dc to replace dc1) and it took a while for dcdiag to stop reporting various warnings. 

 

I seem to have one issue remaining which is the member servers all still receive DC1 as the primary DNS server via DHCP. However, because these are VMs in an Azure vNET i believe this is currently out of our control (unless we manually set the DNS hosts on the NICs) and I hope will update at some point of the "DNS Lifecycle":
https://learn.microsoft.com/en-us/azure/virtual-network/virtual-networks-name-resolution-for-vms-and...