Understanding Lock Contention in Windows Performance Analyzer (WPA)

Becky · ‎Jun 14 2022

Understanding Lock Contention in WPA

Hi everyone, this is Will Aftring with the Windows Debug team. I was debugging an application performance issue and thought “this is a great example of lock contention”.

For this post we will be using the Windows Performance Analyzer (WPA) to review data collected with the Windows Performance Recorder (WPR). For the sake of keeping this post focused I won’t go in depth on WPR but there are plenty of resources on how to get started. Getting Started Windows Performance Recorder | Microsoft Docs

Let’s start with some vocabulary regarding thread states.

Schedule a thread: To schedule a thread is to ensure that it is the next thread to run on the CPU
Preempt a thread: To remove a thread from a processor before it has completed its work or yielded for another thread to run.
Ready: This thread is ready to run and is consideration for next to hop on the CPU
Deferred: This thread is waiting to run on a specific processor but hasn’t been scheduled yet.
Standby: This thread has been selected to run next on a specific processor. When a specific condition is met, the dispatcher switches in the thread.
- Important: Only one thread can be in standby state for each processor on the system
Running: This thread is currently on the CPU doing, performing work
Waiting: A thread that is waiting for a specific condition to be met. This can be either voluntary (ie WaitForSingleObject) or involuntary (waiting for memory to be paged-in). When the condition is met the thread is moved back into a ready state.
- It is important to keep in mind that Windows does not follow a FIFO model for waiting threads.
Transition: A thread enters this state if it’s ready for execution, but its kernel stack is paged out of memory. Once its kernel stack is brough back into memory, the thread enters a ready state.
Terminated: When a thread finishes executing or is told to terminate (TerminateThread), it enters this state.

Here is a flow diagram that covers the thread interaction:

Now that we have laid our groundwork for thread states, let’s jump into some analysis.

Below I have the CPU precise view open.

Let’s clarify what these columns mean.

Name	Description
New Process	The process being readied
New Thread Id	The thread id within the process specified in New Process
New Thread Stack	The call stack that is running during a specific time
Ready Thread Stack	The call stack that readied the new thread stack
Readying Process	The process owning the ready thread stack
Readying Thread Id	Thread that owns the ready thread stack
Yellow Bar	This indicates a pivot point in the data and moving columns across this bar will change how the information is displayed
Count	Number of calls to this function
CPU Usage (sum)	The cumulative time the specified thread / process spends before it was switched back out
Ready time (sum)	The cumulative time the specified thread / process spent in readied state
Waits time (sum)	The cumulative time the specified thread / process spent in a waiting state
Ready time (max)	The maximum single ready time seen for the specified thread / process
Wait time (max)	The maximum single wait time seen for the specified thread / process

Understanding waits with an example program

I know I just threw two pages of definitions at you so let’s put this new info into practice with an example.

This program does the following:

main
- Initializes a Critical Section Object
- Creates 200 threads that will all run the CritThread function
- Wait for user input
CritThread
- Attempt to acquire a critical section
- Print to console
- Leave critical section

While any of the threads running the CritThread function owns the critical section, none of the other threads in this application can enter that critical section.

If we have 5 threads (1,2,3,4,5) and thread 1 calls EnterCriticalSection on g_CS, threads 2-5 will wait in EnterCriticalSection until thread 1 has called LeaveCriticalSection. You can think about this like driving a car, only one person can drive at a time.

Now that we understand how this interaction is supposed to work, what does it look like in WPR?

As we follow the cumulative wait times column, we can see that starting in the highlighted frame below, we have a big drop off in the wait time.

Here is a zoom in of the same highlighted frame above.

We can see thread 5652 is waiting for a critical section, but what does that mean in the context of the columns?

Count 110: This function was called 110 times
CPU Usage 0.791: This function spent 0.791 ms before the thread was switched out
Ready time (sum) 852.600: This function spends a cumulative 852 microseconds in a ready state
Wait times (sum) 4,320,242: This function spent a cumulative 4,340,242 microseconds in a ready state
Ready time (max) 33.700: The longest amount of time that the function spent in a ready state was 33.700 microseconds
Wait times (max) 183,466: The longest single wait time this function had was 183,466.800 microseconds

Let’s go even deeper. If we look at the ntdll.dll!RtlEnterCriticalSection we can see that the thread stack can be expanded further to show the readying stack of the thread that readied 5652.

In the red box, we can see the call stack of the threads that readied thread 5652.

In the green box, we can see the thread id of each thread that readied thread 5652.

Meaning that each of those threads at one time had the red box call stack which readied the new thread.

Let’s shift the view to the columns on the right side of the yellow bar, there are a few important things to keep in mind.

Breaking down thread 484, in the purple box there are a total of 3 threads (red, blue, and green) that have readied thread 5652.

Looking at our columns, we can see that each of those instances has its own value for the columns. This makes sense because each of these single instances is an occurrence of the readying thread stack.

Thinking through the display, we can see that each of the wait times from the readying thread contributes to the cumulative wait time. Now from the wait times displayed, none of the individual waits are particularly long.

With none of the waits being particularly long, the cumulative effect is death by a thousand paper cuts. Resulting in lock contention.

Important! Readying thread vs lock owner

When considering readying threads and locks it can be understandable to think that just because a readying thread is releasing a lock that it has owned that lock for the full duration of the wait.

While that may be true when there are only two threads, that is not necessarily true when you have more threads.

In this scenario, thread 1 will always ready thread 2 and thread 2 will always ready thread 1. And if it is taking a long time to acquire the critical section it is because the other thread must be doing something that is taking a long time.

But as we increase the number of threads, the complexity increases along with the waiting times.

In this scenario thread 2 readied thread 4 but while thread 4 was waiting the critical section was owned by:

Thread 1
Thread 3
Thread 2

But only thread 2 readied thread 4. Tricky right?

But so what?

We have been able to identify that the waits we are seeing in our application are primarily from many short waits causing the cumulative waiting time for a thread to be large.

What can we do about this?

The good news is when dealing with lock contention we get to be creative with our solution and have lots of flexibility to alleviate the issue.

To be clear, the wrong solution is to throw more threads at the issue. Each thread that we add that contends for a lock under contention, the waits get exponentially worse.

Below is a breakdown of time spent waiting for a critical section, as the number of threads waiting for the critical section increases, the longest individual wait increases.

Threads	Longest individual wait (ms)
2	18
4	86
16	152
256	1658

Clear as day we can see that the more threads we throw at a lock under contention, the longer each individual wait will get.

Now let’s jump into what we can do.

Considering lock usage

How am I using this lock?

When considering lock usage, it is typically to protect a specific resource and perform some work on the protected resource.

But the longer a resource is protected, the higher the likelihood that other threads will wait for the release of that resource.

Let’s use the following scenarios as an example:

We lock the entire function that modifies the resource

We lock the operations that modify the resource

These may look very similar but in scenario 2, the developer is leveraging a tight scope on when the critical section is locked. This prevents any delays from WriteEventLog from preventing other threads from reading or writing to pData.

The shorter the amount of time a lock is held, the less likely other threads are going to wait for that lock.

Does this lock meet program needs?

In all discussions so far, we have been talking about locking resources using critical sections.

Typically, the usage of a lock is to allow inter-thread synchronization to protect a specific resource to ensure that only one thread is operating on that resource at a time. But the question becomes, does this specific interaction need to be mutually exclusive?

Let’s say I have 100 threads. 99 threads read the value of a variable and print it to the console and 1 thread changes that variable’s value.

Is there any harm in having the 99 reader threads access the variable at the same time and only have the resource locked when the writer thread is operating on it?

If not, then maybe a Slim Reader/Writer (SRW) Locks is the correct lock for your program.

A resource that is locked with a reader lock will not contend with any other reader threads.

A resource that is under a writer lock will contend with both reader and writer threads.

From the SRW documentation:

Reader threads read data from a shared resource whereas writer threads write data to a shared resource. When multiple threads are reading and writing using a shared resource, exclusive locks such as a critical section or mutex can become a bottleneck if the reader threads run continuously but write operations are rare.

It's easy when thinking about thread synchronization as a case where everything looks like a nail and a critical section is the hammer. But it is important to look at all the tools you have in your toolbelt before you start swinging.

Can I distribute this work?

It can be convenient when scaling an app to scale vertically and create one very big strong machine. But there are still limitations to what one single machine can do.

If the bottleneck isn’t the available resources (CPU utilization, available memory, etc.) but waiting for a single resource, then how big your machine is less relevant.

You can often find better performance not by throwing all the work at a single machine but allowing multiple machines to perform the work.

For example, what if an application heavily relies on one domain controller (DC)? After initial DC discovery, the result will be stored in the DCLocator cache, and all subsequent work will be sent to that DC. If that DC is experiencing bottlenecks due to lock contention, then why not send the work to another DC? Or even better, don’t wait for that DC to encounter issues before distributing the work. If the information is available from any DC, why not send the work to any DC?

Do I need to be doing this much work in the first place?

This can be a tough pill to swallow but it is important to keep it in mind.

The correct scaling solution isn’t to throw work at a machine until it falls over, but rather to find an appropriate performance balance between necessary system functionalities and application workloads.

When work is bottlenecking an app, it may be worth considering if that work is necessary.

This often comes in the form of:

Frequency (Do I need to query this information X times per Y?)
Size (Do I need to query all items in database Y?)

The idea of pulling back is usually the quickest way to alleviate the symptoms. This can take the form of:

Reduces the number of threads contending for a lock
Reduces time spent while holding a lock

Pulling back is particularly relevant when the lock contention is occurring outside of the application. There are necessary locks within the operating system that multiple applications will leverage. If your application throws unnecessary amounts of work at this lock, everyone else who relies on this lock will suffer.

Wrapping up

At the end of the day, lock contention is an issue that arises frequently, and its solution is often a part of a larger conversation.

I understand that the suggestions above are not easy, but they are important. Not only within your application but within a healthy operating system. Applications have internal locks, but they run on the operating system which has its own locks which the application will leverage directly or indirectly. Applications running on an operating system need to behave like good roommates.

Clean up after yourself
Don’t hog the TV
Don’t use all the hot water
Etc…

Otherwise, everyone is going to suffer as a result of one bad roommate.

If you enjoyed this post and would like to know more, feel free to check out my posts on the Microsoft CISTech community at https://aka.ms/WillAftring.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs