The PDCe with too much to do
Published Apr 04 2019 06:47 PM 1,049 Views
Microsoft
First published on TechNet on Sep 23, 2011

Hi. Mark again. As part of my role in Premier Field Engineering, I’m sometimes called upon to visit customers when they have a critical issue being worked by CTS, needing another set of eyes. For today’s discussion, I’m going to talk you through, one such visit.


It was a dark and stormy night …


Well not really – it was mid-afternoon but these sorts of things always have that sense of drama.


The Problem


Custom applications were hard coded to use the PDC Emulator (PDCe) for authentication – a strategy the customer later abandoned to eliminate a single point of failure. The issue was hot because the PDCe was not processing authentication requests after a reboot.


The customer had noticed lsass.exe consuming a lot of CPU and this is where CTS were focusing their efforts.


The Investigation


Starting with the Directory Service event logs, I noticed the following:


Event Type:          Information


Event Source:        NTDS Replication


Event Category:      Replication


Event ID:            1555


Date:                <Date>


Time:                <Time>


User:                NT AUTHORITY\ANONYMOUS LOGON


Computer:            <Name of PDCe>


Description:


The local domain controller will not be advertised by the domain controller locator service as an available domain controller until it has completed an initial synchronization of each writeable directory partition that it holds. At this time, these initial synchronizations have not been completed.



The synchronizations will continue.



also:


Event Type:          Warning


Event Source:        NTDS Replication


Event Category:      Replication


Event ID:            2094


Date:                <Date>


Time:                <Time>


User:                NT AUTHORITY\ANONYMOUS LOGON


Computer:            <Name of PDCe>


Description:


Performance warning: replication was delayed while applying changes to the following object. If this message occurs frequently, it indicates that the replication is occurring slowly and that the server may have difficulty keeping up with changes.


Object DN: CN=<ClientName>,OU=Workstations,OU=Machine Accounts,DC=<Domain Name>,DC=com



Object GUID: <GUID>



Partition DN: DC=<Domain Name>,DC=com



Server: <_msdcs DNS record of replication partner>



Elapsed Time (secs): 440




User Action



A common reason for seeing this delay is that this object is especially large, either in the size of its values, or in the number of values. You should first consider whether the application can be changed to reduce the amount of data stored on the object, or the number of values.  If this is a large group or distribution list, you might consider raising the forest version to Windows Server 2003, since this will enable replication to work more efficiently. You should evaluate whether the server platform provides sufficient performance in terms of memory and processing power. Finally, you may want to consider tuning the Active Directory database by moving the database and logs to separate disk partitions.



If you wish to change the warning limit, the registry key is included below. A value of zero will disable the check.



Additional Data



Warning Limit (secs): 10



Limit Registry Key: System\CurrentControlSet\Services\NTDS\Parameters\Replicator maximum wait for update object (secs)




and:


Event Type:          Warning


Event Source:        NTDS General


Event Category:      Replication


Event ID:            1079


Date:                <Date>


Time:                <Time>


User:                <SID>


Computer:            <Name of PDCe>


Description:


Internal event: Active Directory could not allocate enough memory to process replication tasks. Replication might be affected until more memory is available.



User Action


Increase the amount of physical memory or virtual memory and restart this domain controller.




In summary, the PDCe hasn’t completed initial synchronisation after a reboot and it’s having memory allocation problems while it works on sorting it out. Initial synchronisation is discussed in:


Initial synchronization requirements for Windows 2000 Server and Windows Server 2003 operations master role holders
http://support.microsoft.com/kb/305476


With this information in hand, I had a chat with the customer hoping we’d identify a relevant change in the environment leading up to the outage. It became apparent they’d configured a policy for deploying RDP session certificates. Furthermore, they’d noticed clients receiving many of these certificates instead of the expected one .


RDP session certificates are Secure Sockets Layer (SSL) certificates issued to Remote Desktop servers. It is also possible to deploy RDP session certificates to client operating systems such as Windows Vista and Windows 7. More on this later…


The customer and I examined a sample client and found 285 certificates! In addition to this unusual behaviour, the certificates were being published to Active Directory. There were 3700 affected clients – approx. 1 million certificates published to AD!


The Story So Far


We’ve injected huge amounts of certificate data into the userCertificate attribute of computer objects, we’ve got replication backlog due to memory allocation issues and the DC can’t complete an initial sync before advertising itself as a DC.


What Happened Next Uncle Mark?!


The CTS engineer back at home base wanted to gather some debug logging of LSASS.exe. While attempting to gather such a log, the PDCe became completely unresponsive and we had to reboot.


While the PDCe rebooted, the customer disabled the policy responsible for deploying RDP session certificates.


After the reboot, the PDCe had stopped logging event 1079 (for memory allocation failures) but in addition to event 1555 and 2094, we were now seeing:


Event Type           Warning


Event Source:        NTDS Replication


Event Category:      DS RPC Client


Event ID:            1188


Date:                <Date>


Time:                <Time>


User:                NT AUTHORITY\ANONYMOUS LOGON


Computer:            <Name of PDCe >


Description:


A thread in Active Directory is waiting for the completion of a RPC made to the following domain controller.



Domain controller:


<_msdcs DNS record of replication partner>


Operation:


get changes


Thread ID:


<Thread ID>


Timeout period (minutes):


5



Active Directory has attempted to cancel the call and recover this thread.



User Action


If this condition continues, restart the domain controller.



For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp .


A bit more investigation with:



Repadmin.exe /showreps (or /showrepl for later versions of repadmin)



told us that all partitions were in sync except the domain partition – the partition with a million certificates attached to computer objects.


We decided to execute:



Repadmin.exe /replicate <Name of PDCe> <Closest Replication Partner> <Domain Naming Context> /force



Next, we waited … for several hours.


While waiting, we considered:



  • Disabling initial sync with:


[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters]


Repl Perform Initial Synchronizations = 0



  • Increasing the RPC timeout for NTDS with:



http://support.microsoft.com/default.aspx?scid=kb;EN-US;830746



Both of these changes require a reboot. The customer was hesitant to reboot again and while they thought it over, initial sync completed.


With the PDCe authenticating clients, I headed home to get some sleep. The customer had disabled the RDP session certificate deployment policy and was busy clearing the certificate data out of computer objects in Active Directory.


Why?


The next day, I went looking for root cause. The customer had followed some guidance to deploy the RDP session certificates. Some of the guidance noted during the investigation is posted here:



http://blogs.msdn.com/b/rds/archive/2010/04/09/configuring-remote-desktop-certificates.aspx ...



I set up a test environment and walked through the guidance. After doing so, I did not experience the issue. I was getting a single certificate no matter how often I would reboot or apply Group Policy. In addition, RDP session certificates were not being published in Active Directory. Publishing in Active Directory is easily explained by this checkbox:





An examination of the certificate template confirmed they had this checked.


So why were clients in the customer environment receiving multiple certificates while clients in my test environment received just one?


The Win


I noticed the following point in the guidance being followed by the customer:





A bit of an odd recommendation. Sure enough, the customer’s template had different names for “Template display name” and “Template name”. I changed my test environment to make the same mistake and suddenly I had a repro – a new certificate on every reboot and policy refresh.


Some research revealed that this was a known issue. One of these fields checks whether an RDP session certificate exists while the other field obtains a new certificate. Giving both fields the same name works around the problem.


Conclusion


So in the aftermath of this incident, there are some general recommendation that anyone can take to help avoid this kind of situation.



  • Follow our guidance carefully – even the weird stuff

  • Test before you deploy

  • Deploy the same way as you test

  • Avoid making critical servers more critical than they need to be


- Mark “Falkor” Renoden

Version history
Last update:
‎Apr 04 2019 06:47 PM
Updated by: