First published on TechNet on Aug 13, 2012
Hey y’all, Mark here again with an interesting “Real World” XPerf trace from the depths of a WDRAP (Windows Desktop Risk Assessment). If your infrastructure hasn’t had a WDRAP yet, or it’s been a while, work with your Technical Account Manager to see if you can schedule one. I highly recommend it as a great way to expose and diagnose issues that might be impacting 10s, 100s or even 10,000s of users.
It was during a WDRAP that we came across this interesting slow boot/slow logon situation and analyzed it with an XPerf trace. The target system’s overall boot time wasn’t terrible - about 2 minutes and 5 seconds isn’t too bad for a typical corporate environment, but we see a large chunk of the trace, almost a full minute - 60 long seconds, was being taken up during the Winlogon phase. I was hopeful we could improve or at least identify what was eating up that looooong 60 second time slice.
To highlight the time range through all the other frames in the trace, we lined everything up within XPerf so it would be easier on the eyes:
· Select an area on the graph
· Right-click and choose “Clone Selection”
We touched on the boot phases in our other XPerf posts - you check them out here. We know that Group Policy processing and loading of your profile occurs around this time, in addition to other processing.
Notice in the top graph above, we have 3 diamonds that say “GPClient” - what are those actually doing?
· The bottom “GPClient” - that’s the computer policy
· The middle “GPClient” - in our case here, the one lowest in the list, is the user policy
· Finally, the top “GPClient” - is if there are any Group Policy logon scripts
In the screen shot above, we see that these 3 diamonds are finishing quickly, as are most all the other diamonds in this phase.
So where is all this time going?
It is time to dig into the “Generic Events” section of the trace which is located directly under “Boot Phases” to give us a clue of what might be happening.
In Generic Events, right-click the area we have selected, and click “Summary table.”
· This will open up the “Generic Events Summary Table” as shown below
· The area we are interested in here is “Microsoft-Windows-Winlogon”, so expand that
I ordered the columns as follows:
· The first column is the “Provider Name” - we are interested in Winlogon
· Second column is “Time”
We know pretty close when the issue is occurring.
· Next, we have “Process” and then “Thread ID”
· Finally, “Task Name”
Let’s jump to where our problem seems to be occurring – you can see from the screen shot there is a delay right after 16-17 seconds – where the time jumps to 70 seconds.
If Barney Stinson from “How I Met Your Mother” was doing this WDRAP he’d say, “Log wait for it…..on!” At about 17 seconds, the system is waiting for credentials. At 70 seconds, the system finally receives them and starts the logon process. Case closed. Or is it?
The customer here was using a McAfee Endpoint Encryption product that would kick-in after the user enters his/her logon credentials. It would start decrypting the drive, then pass the credentials over to Windows and log the user in to the Desktop.
As a test, we temporarily disabled this and used the autologon registry keys in Windows to see if we could repo the issue. We did not experience a delay – we concluded that it had to be something with how these credentials were being handled. It was time to dig even deeper.
We zoomed into this time frame on all graphs by right clicking anywhere on the selection and clicking “Zoom to Selection.”
Something that jumped out right away was the DPC CPU Usage Graph (below):
Anything over 15% should be investigated.
I right-clicked the “DPC CPU Usage” graph and clicked “Summary Table” but I didn’t see anything interesting in the results around McAfee.
We needed to look at other CPU areas so we went to “CPU Sampling by CPU.” If you don’t see it, you may have to add it from the fly out menu on the left. Right-click and select “Summary Table.”
It’s time to do some more column sorting. Since we know we are looking for DPC from before, we move that all the way to the left followed by the “Module” column. This Module looks suspiciously like a file that is part of McAfee Endpoint Encryption. We need to verify that by using one last trick – the “Image Summary Table.”
If you ever wanted to find out which version a file is, you can use the “Image Summary Table” view. To see this, you’ll need to go to “Process Lifetime,” right-click the area you have selected, and pick “Image Summary Table” as shown below.
If we sort our columns in the following order, we should be able to get some good information about this file.
The order I have is:
· Image Name
· Binary File Version
· Company Name
· Product Version
Let’s find MfeEEAlg.sys in our table and see if it all adds up:
Sure enough it does. I request the highest of fives.
The next steps for the customer were to contact McAfee through their support channel to further drill down into the issue. Our analysis and drill-down was able to provide the customer a very clear direction on where the delay was occurring. This should help to speed up the rest of the troubleshooting process with the other vendor.
I want to send special “thanks” out to fellow PFE Yong Rhee who helped with this. Maybe I’ll get him to post up here one day and really teach us all a thing or two about Xperf.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.