When dealing with server hangs, it is important to distinguish between what we call a hard hang and a soft hang. This will oftentimes help us at least diagnose the basic problem based on what we can and cannot do on the machine. For example, if we are unable to ping the server, toggle the NumLock or Caps Lock functions via the keyboard or get any sort of mouse cursor responsiveness, then we are most likely dealing with a hard hang. These issues are generally hardware related (possibly driver related), but seldom due to a Windows OS configuration issue or Memory Leak. In the case of a hard hang, the system is hung at a very low level in the kernel and is no longer processing threads. In the event of a hard hang, the first step is to contact your hardware vendor to run a diagnostic on the system. Unless you have a specific reason to suspect certain hardware (for example recently installed RAM etc), I wouldn't recommend randomly pulling out or replacing hardware.
Turning our attention to soft hangs now, when a machine is in a soft hang state it is mostly unresponsive, but the kernel is still functional at a very low level - for example, the ping test, or toggling Numlock will work fine. In a soft hang state, you may not be able to log on to the machine either locally or via Terminal Services or you may experience a blank desktop - however network and printer shares may still be accessible. This is more typical of the type of symptoms we see during memory depletion or a process deadlock.
A common hang issue that we see is caused by depletion of paged or non-paged pool memory (we covered the basics of Pool Resources in a previous post ). When these resources are depleted, you will see events similar to the ones below in the System Event Log:
As you can see, a 2019 error indicates depletion of non-paged pool memory and a 2020 error indicates that you are out of paged pool memory. If you see either of these events logged before a hang, there is a good chance that solving the depletion issue will also resolve the hang problem. Our Platforms CPR team published a blog post last year that covered troubleshooting the 2019 & 2020 issues , so we won't rehash the same information here.
A slightly harder issue to pinpoint is a hang caused by depletion of the System Page Table Entries (PTE's). We covered System PTE's briefly in our previous post on the /3GB switch . PTE's are structures used to track pages in RAM, similar to the way in which the index of a book tells you what page things are located on. PTE's tell the system in which physical page of memory data resides. Machines start with a fixed amount of PTE's - the more RAM in the system, the more PTE's required to point to the memory pages. If a system runs out of available page table entries, it can no longer allocate memory - resulting in a hung or unresponsive system.
Unfortunately, when System PTE's are depleted, there are no entries indicating this in the Event Log. However, you can use Performance Monitor to monitor the Free System PTE's. There is no counter to break down the PTE usage per process, so determining the culprit of PTE depletion simply using Performance Monitor is not always feasible. You may be able to correlate a process' handle count continually rising (handle leak) with PTE depletion, however unless there is an obvious culprit a memory dump or live debug will be required.
So to recap, here are the basic steps to follow regarding a complete system hang:
That brings us to the end of this post on Server Hangs. We will go over the basic debugging of a server hang in a future post, so stay tuned.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.