About exceptions and capturing them with dumps

Microsoft

Mar 25, 2021

To fix application exceptions, we have to understand them. Sometimes we need more than error messages or log entries, we need more context around them. Enter collecting memory dumps. This article is aimed for the less experienced engineers that need to collect data for analysis around exceptions.

Exceptions happen all the time in production applications hosted with IIS; they are more frequent than we think:

Worst exceptions are crashing the IIS worker process – w3wp.exe – that is executing our web app; but because WAS would restart the process with a warning in Windows events, we quickly end up with a new w3wp.exe (new PID) and potentially degraded user experience (like sessions lost).
Some exceptions are causing the infamous HTTP response code or pages “500, Server-side processing error”; mostly when the app code is not handling them, and they fall to be further handled by the Asp.Net framework.
And some exceptions may even go unnoticed, but contributing to less-than-optimal performance, because the system is trying to recover after them, spending CPU cycles.

So how are things happening with exceptions? How do they cause the above behaviors? And how can we study them?

The exception handling in Windows

Every process has at least one code execution thread; often, more. Exceptions happen in process threads. As the thread is executing code, some unexpected situations happen. That occurrence is treated by Windows much like a hardware interrupt: code execution is suspended, and the exception is first reported, then recovery is attempted.

When a process is started, Windows creates associated ports for exceptions in that process: The Debug port, the Exceptions port. When an exception occurs, a few steps are taken in the attempt for recovery:

First chance

The exception is first reported to the Debug port. A debugger may be attached to that port of the process. The debugger has a first chance to look and possibly handle the exception. At this stage, we call it a First-chance exception.

Stack unwinding

If there is no debugger attached, Windows is inspecting the call stack – the succession of function calls – of the thread where the exception occurred. The aim is to find an exception handler – think of it as the try-catch C# construct – able to treat that exception. If the function at the top of the stack, the last one called, does not have the handler, we try with the previous one – N-1. If N-1 does not have the exception handler either, we try with N-2... and so on, until we find a handler for the exception. This process is called the “stack unwinding” and will consume CPU cycles.

Second chance

If the entire thread’s call stack was “un-winded” and we still haven’t found a handler for our exception, the exception will be reported by Windows in the Exception port. A debugger may be attached to that port. Now, the debugger would have its second chance to handle the exception. At this stage, we call it a Second-chance exception.

Process killed

If a debugger is not treating the now-called second-chance exception, the unexpected event is reported in the Exception port and then... Well, the operating system is handling the exception in its way: since system stability has to be kept and the exception has the potential to break other things, like causing data loss, Windows is terminating the process. The entire virtual memory of the process is evicted, and the process is killed. Traces of the event may be left in Windows events, Application log.

Will it crash or not

Because Windows is looking for an exception handler, we never know for sure if a first-chance exception will break the process or not. With Asp.Net applications, this uncertainty increases. Remember that the app runs on top of the Asp.Net framework. Even if the application’s code is not handling the exception, the framework may handle it. How? Well, if the exception happens in the context of processing an HTTP request, then Asp.Net would handle the exception by wrapping it with an error page, or HTTP error status code: “500, Server-side processing error”.

In fact, with Asp.Net, most of the exceptions will be handled by the framework. And so, these exceptions will not crash the executing process. Which makes sense: one of the primary goals of the framework is to continue serve requests, even if some of them fall victims due to exceptions and some users/clients may get a degraded experience.

If the exception happens in a non-request-executing context – like at the application startup or in the finalizer thread – then it has higher chances to reach the stage of second-chance exception, causing the crash of the process. We almost consider them synonyms: second-chance exception = process crash, abnormal termination.

A simplified view for exception handling and capturing

Options to study exceptions

The Asp.Net framework or the .NET runtime, when handling the exceptions, may report these occurrences in Windows events, in Application log. The events may even contain some context, like exception name, the HTTP request being executed, or the call stack of the faulting thread. What if this doesn’t happen? What if we don’t get these traces, or we need even more context about the exceptions? Well, meet the debuggers...

Remember that a debugger may be attached to the Debug port of a process? And that all exceptions are first reported on that port? Well, with a debugger attached, we can take actions on exceptions.

We won’t attach debuggers like Visual Studio one or “real”, “heavy” debuggers in production; frequently, we don’t have that luxury of installing or executing such tools. But there are simple debuggers able to take simple but powerful actions upon exception occurrences.

With a “simple” debugger we can watch all process exceptions, like displaying them on the console. This is the least invasive action; we could share these with the developers, to let them know what happens with their app when in production.

We can log the call stack of the faulting thread, for more context. Or we can tell the debugger to collect the full memory dump of the process at the precise moment when the exception occurs – and we get call stack, object heaps, down to the very processor registry values at that precise moment. With a memory dump, we get a lot of context. We can analyze the memory dump post-mortem to try understanding what the conditions were causing an exception.

Common tools to capture exceptions and context

When troubleshooting, when we realize we need to know and study an exception, we regularly use SysInternal’s ProcDump or Debug Diagnostics. These free tools from Microsoft act like debuggers. We attach them to processes, then take actions like above upon exception occurrence: record call stacks, trigger memory dump collection, etc.

These tools are a must when we investigate w3wp.exe process crashes – we can’t manually collect a memory dump on a second-chance exception / crash. Both ProcDump and DebugDiag can collect the memory dump just before the process and its memory get evicted due to a crash.

While DebugDiag offers a UI and a bit more flexibility, ProcDump is very small and requires no installation. ProcDump is the only option in Windows installations where we don’t have an UI. DebugDiag was created with IIS troubleshooting as a goal, so it has quite a few features helping on that. Oh, and it comes with the functionality of analyzing memory dumps for common issues; for Asp.Net apps based on the “full” .NET framework, it gets the enough actionable info in most cases.

How to: steps for collecting dumps

I'll place a few illustrated guides, steps on how to use ProcDump and/or DebugDiag to collect memory dumps:

Memory dumps at process termination caused by second-chance exception
Simply have collect the committed memory of a process just before it gets evicted by the OS on killing the process.
Collect call-stack traces for first-chance exceptions occurring in an app
We would do that when we have several annoying exceptions and would need just a bit more context around them; we don’t really want a full “fat” dump for each of the first-chance exceptions.
Collect memory dump(s) for a first-chance exception, when it occurs
When the process is not actually crashing, ever; but we keep getting HTTP responses with “500” status, for same exception, and we need to study the deep context.
Memory dumps at process termination caused by second-chance exception, with optional first-chance dump too
Because we may have some first-chance exceptions building up the “right” context for a second-chance exception to occur.

In addition to collecting memory dumps to study exceptions, some performance-related issues can be studied with dump analysis too. Dumps are very useful when we see memory leaks, when requests are way too slow or the process is downright hanged – requests are piling up in the processing pipeline, and they stay there without a response being generated.
I'll reference here the illustrated procedures for:

Dump series for hang and performance issues, using a "automated" approach
Have DebugDiag use ETW to monitor HTTP requests; when some are slow, trigger memory dump collection
Dump series for hang and performance issues, using a "manual" approach
Steps to collect memory dumps wen we can witness the slowness, or we can easily reproduce it
Collect memory dumps on high-CPU situations
With DebugDiag or ProcDump we can trigger dump collection on a process when the CPU consumption goes high
Collect series of memory dumps to study memory leaks
Have DebugDiag watching the memory size of a process, collect dumps at certain thresholds