Hi. Mark again. As part of my role in Premier Field Engineering, I’m sometimes called upon to visit customers when they have a critical issue being worked by CTS, needing another set of eyes. For today’s discussion, I’m going to talk you through, one such visit.
It was a dark and stormy night …
Well not really – it was mid-afternoon but these sorts of things always have that sense of drama.
The Problem
Custom applications were hard coded to use the PDC Emulator (PDCe) for authentication – a strategy the customer later abandoned to eliminate a single point of failure. The issue was hot because the PDCe was not processing authentication requests after a reboot.
The customer had noticed lsass.exe consuming a lot of CPU and this is where CTS were focusing their efforts.
The Investigation
Starting with the Directory Service event logs, I noticed the following:
Event Type: Information
Event Source: NTDS Replication
Event Category: Replication
Event ID: 1555
Date: <Date>
Time: <Time>
User: NT AUTHORITY\ANONYMOUS LOGON
Computer: <Name of PDCe>
Description:
The local domain controller will not be advertised by the domain controller locator service as an available domain controller until it has completed an initial synchronization of each writeable directory partition that it holds. At this time, these initial synchronizations have not been completed.
The synchronizations will continue.
also:
Event Type: Warning
Event Source: NTDS Replication
Event Category: Replication
Event ID: 2094
Date: <Date>
Time: <Time>
User: NT AUTHORITY\ANONYMOUS LOGON
Computer: <Name of PDCe>
Description:
Performance warning: replication was delayed while applying changes to the following object. If this message occurs frequently, it indicates that the replication is occurring slowly and that the server may have difficulty keeping up with changes.
Object DN: CN=<ClientName>,OU=Workstations,OU=Machine Accounts,DC=<Domain Name>,DC=com
Object GUID: <GUID>
Partition DN: DC=<Domain Name>,DC=com
Server: <_msdcs DNS record of replication partner>
Elapsed Time (secs): 440
User Action
A common reason for seeing this delay is that this object is especially large, either in the size of its values, or in the number of values. You should first consider whether the application can be changed to reduce the amount of data stored on the object, or the number of values. If this is a large group or distribution list, you might consider raising the forest version to Windows Server 2003, since this will enable replication to work more efficiently. You should evaluate whether the server platform provides sufficient performance in terms of memory and processing power. Finally, you may want to consider tuning the Active Directory database by moving the database and logs to separate disk partitions.
If you wish to change the warning limit, the registry key is included below. A value of zero will disable the check.
Additional Data
Warning Limit (secs): 10
Limit Registry Key: System\CurrentControlSet\Services\NTDS\Parameters\Replicator maximum wait for update object (secs)
and:
Event Type: Warning
Event Source: NTDS General
Event Category: Replication
Event ID: 1079
Date: <Date>
Time: <Time>
User: <SID>
Computer: <Name of PDCe>
Description:
Internal event: Active Directory could not allocate enough memory to process replication tasks. Replication might be affected until more memory is available.
User Action
Increase the amount of physical memory or virtual memory and restart this domain controller.
In summary, the PDCe hasn’t completed initial synchronisation after a reboot and it’s having memory allocation problems while it works on sorting it out. Initial synchronisation is discussed in:
Initial synchronization requirements for Windows 2000 Server and Windows Server 2003 operations master role holders
http://support.microsoft.com/kb/305476
With this information in hand, I had a chat with the customer hoping we’d identify a relevant change in the environment leading up to the outage. It became apparent they’d configured a policy for deploying RDP session certificates. Furthermore, they’d noticed clients receiving many of these certificates instead of the expected one .
RDP session certificates are Secure Sockets Layer (SSL) certificates issued to Remote Desktop servers. It is also possible to deploy RDP session certificates to client operating systems such as Windows Vista and Windows 7. More on this later…
The customer and I examined a sample client and found 285 certificates! In addition to this unusual behaviour, the certificates were being published to Active Directory. There were 3700 affected clients – approx. 1 million certificates published to AD!
The Story So Far
We’ve injected huge amounts of certificate data into the userCertificate attribute of computer objects, we’ve got replication backlog due to memory allocation issues and the DC can’t complete an initial sync before advertising itself as a DC.
What Happened Next Uncle Mark?!
The CTS engineer back at home base wanted to gather some debug logging of LSASS.exe. While attempting to gather such a log, the PDCe became completely unresponsive and we had to reboot.
While the PDCe rebooted, the customer disabled the policy responsible for deploying RDP session certificates.
After the reboot, the PDCe had stopped logging event 1079 (for memory allocation failures) but in addition to event 1555 and 2094, we were now seeing:
Event Type Warning
Event Source: NTDS Replication
Event Category: DS RPC Client
Event ID: 1188
Date: <Date>
Time: <Time>
User: NT AUTHORITY\ANONYMOUS LOGON
Computer: <Name of PDCe >
Description:
A thread in Active Directory is waiting for the completion of a RPC made to the following domain controller.
Domain controller:
<_msdcs DNS record of replication partner>
Operation:
get changes
Thread ID:
<Thread ID>
Timeout period (minutes):
5
Active Directory has attempted to cancel the call and recover this thread.
User Action
If this condition continues, restart the domain controller.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp .
A bit more investigation with:
Repadmin.exe /showreps (or /showrepl for later versions of repadmin)
told us that all partitions were in sync except the domain partition – the partition with a million certificates attached to computer objects.
We decided to execute:
Repadmin.exe /replicate <Name of PDCe> <Closest Replication Partner> <Domain Naming Context> /force
Next, we waited … for several hours.
While waiting, we considered:
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters]
Repl Perform Initial Synchronizations = 0
http://support.microsoft.com/default.aspx?scid=kb;EN-US;830746
Both of these changes require a reboot. The customer was hesitant to reboot again and while they thought it over, initial sync completed.
With the PDCe authenticating clients, I headed home to get some sleep. The customer had disabled the RDP session certificate deployment policy and was busy clearing the certificate data out of computer objects in Active Directory.
Why?
The next day, I went looking for root cause. The customer had followed some guidance to deploy the RDP session certificates. Some of the guidance noted during the investigation is posted here:
http://blogs.msdn.com/b/rds/archive/2010/04/09/configuring-remote-desktop-certificates.aspx ...
I set up a test environment and walked through the guidance. After doing so, I did not experience the issue. I was getting a single certificate no matter how often I would reboot or apply Group Policy. In addition, RDP session certificates were not being published in Active Directory. Publishing in Active Directory is easily explained by this checkbox:
An examination of the certificate template confirmed they had this checked.
So why were clients in the customer environment receiving multiple certificates while clients in my test environment received just one?
The Win
I noticed the following point in the guidance being followed by the customer:
A bit of an odd recommendation. Sure enough, the customer’s template had different names for “Template display name” and “Template name”. I changed my test environment to make the same mistake and suddenly I had a repro – a new certificate on every reboot and policy refresh.
Some research revealed that this was a known issue. One of these fields checks whether an RDP session certificate exists while the other field obtains a new certificate. Giving both fields the same name works around the problem.
Conclusion
So in the aftermath of this incident, there are some general recommendation that anyone can take to help avoid this kind of situation.
- Mark “Falkor” Renoden
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.