Following on from our post on
troubleshooting a basic application crash
, it's time to start troubleshooting what to do when your entire server hangs. Most of us have run into situations where the server is so unresponsive that we cannot access Task Manager, or even access the network shares on a server. Of course, it goes without saying that it always seems to be the mission critical servers that experience these issues, which means that the IT Administrator responsible for the server is working in panic mode.
When dealing with server hangs, it is important to distinguish between what we call a hard hang and a soft hang. This will oftentimes help us at least diagnose the basic problem based on what we can and cannot do on the machine. For example, if we are unable to ping the server, toggle the NumLock or Caps Lock functions via the keyboard or get any sort of mouse cursor responsiveness, then we are most likely dealing with a hard hang. These issues are generally hardware related (possibly driver related), but seldom due to a Windows OS configuration issue or Memory Leak. In the case of a hard hang, the system is hung at a very low level in the kernel and is no longer processing threads. In the event of a hard hang, the first step is to contact your hardware vendor to run a diagnostic on the system. Unless you have a specific reason to suspect certain hardware (for example recently installed RAM etc), I wouldn't recommend randomly pulling out or replacing hardware.
Turning our attention to soft hangs now, when a machine is in a soft hang state it is mostly unresponsive, but the kernel is still functional at a very low level - for example, the ping test, or toggling Numlock will work fine. In a soft hang state, you may not be able to log on to the machine either locally or via Terminal Services or you may experience a blank desktop - however network and printer shares may still be accessible. This is more typical of the type of symptoms we see during memory depletion or a process deadlock.
A common hang issue that we see is caused by depletion of paged or non-paged pool memory (we covered the
basics of Pool Resources in a previous post
). When these resources are depleted, you will see events similar to the ones below in the System Event Log:
As you can see, a 2019 error indicates depletion of non-paged pool memory and a 2020 error indicates that you are out of paged pool memory. If you see either of these events logged before a hang, there is a good chance that solving the depletion issue will also resolve the hang problem. Our Platforms CPR team published a blog post last year that covered
troubleshooting the 2019 & 2020 issues
, so we won't rehash the same information here.
A slightly harder issue to pinpoint is a hang caused by depletion of the System Page Table Entries (PTE's). We covered System PTE's briefly in our
previous post on the /3GB switch
. PTE's are structures used to track pages in RAM, similar to the way in which the index of a book tells you what page things are located on. PTE's tell the system in which physical page of memory data resides. Machines start with a fixed amount of PTE's - the more RAM in the system, the more PTE's required to point to the memory pages. If a system runs out of available page table entries, it can no longer allocate memory - resulting in a hung or unresponsive system.
Unfortunately, when System PTE's are depleted, there are no entries indicating this in the Event Log. However, you can use Performance Monitor to monitor the Free System PTE's. There is no counter to break down the PTE usage per process, so determining the culprit of PTE depletion simply using Performance Monitor is not always feasible. You may be able to correlate a process' handle count continually rising (handle leak) with PTE depletion, however unless there is an obvious culprit a memory dump or live debug will be required.
So to recap, here are the basic steps to follow regarding a complete system hang:
Is this a hard or soft hang? If this is a hard hang, then the odds are that there is an underlying hardware issue, so contact your hardware vendor.
Check the Event logs for any events in the System Log at the time of the hang. In the case of Pool Depletion, you will see Event ID's 2019 or 2020 with the Event Source being SRV
Launch Performance Monitor and check the starting value for Free System PTE's under the Memory object. If a system is booting up with fewer Free System PTE's than is normal (around 15,000 or fewer), then that is not a good sign. That means that all the PTE's are being consumed at startup leaving fewer resources available for the normal server operations.
Set up a Performance Monitor log and let it run for a while. At the very minimum, add the counters for Memory, Process, Processor and System. The length of time that you need to let it run will depend on how long the system takes to hang (assuming that this is happening repeatedly). Set the interval so that you can capture at least a hundred samples over the life of the log. Any low memory condition should be readily apparent - especially if it is a steady leak.
Finally, follow the steps in
KB Article 244139
to prepare the system to capture a complete memory dump for analysis if needed.
That brings us to the end of this post on Server Hangs. We will go over the basic debugging of a server hang in a future post, so stay tuned.