Performance troubleshooting basics

The_Exchange_Team · ‎Sep 28 2005

Since I occasionally troubleshoot performance problems, I thought I’d write up the basics of the process I use. Due to the nature of this topic, this isn’t going to be a comprehensive coverage of all possible performance problems. I just want to focus on the main process, and list a few of the things I look at first to break the problem down.

As I see it, there are two main symptoms of a server performance problem:

High average rpc latencies or rpc requests
Large or rising queues

This blog will focus on identifying possible server bottlenecks related to the first symptom, although the steps for rising queues are pretty much the same.

In general, the causes of a performance problem fall into two categories:

1. problems due to increased load

2. problems with a resource bottleneck

When troubleshooting, I always attempt to differentiate between these two categories of causes first. It is important to note than an increase in load will cause a resource bottleneck, unless the increase is sufficiently small that no symptom of poor performance is observed. Thus, if an increase in load is found, I attempt to identify the cause for the increase in load first, and then identify the resource which is bottlenecked.

Resource bottlenecks may occur on any of the following resources:

Disk
Memory
Processor
Network (between client and server, or server and any other server)
Active directory server
Other server resources

Increased load can be caused by a change or increase in user activity, or by other applications using server resources. In this blog, I’ll only talk about load caused by MAPI client load, though the principles apply to any type of load.

Determining if the server has an RPC performance problem

First, find out when and for how long the problem repros. Check to see if you are reproing the symptom now. To verify that the server is exhibiting the reported RPC performance problem, check the following:

See if the MSExchangeIS\RPC latency is higher than 50 ms (this is considered a problem)
See if the number of outstanding RPC requests (MSEXCHANGEIS\RPC Requests) is higher than 50.
Look at the MSEXCHANGEIS\RPC operations per second. Is it higher than you expect? This number depends on the number of users. At Microsoft, we usually see about 0.20 operations per second per user on the server.

If the server isn’t exhibiting a performance issue at the time you are investigating, you may not be able find out what is going wrong. Nonetheless, even if these counters appears to be healthy, if users are complaining, I usually continue to investigate anyway – but keep in mind it’s best to investigate while the server is unhealthy. If the server is healthy, I use the information I gather as a baseline to compare against when the server exhibits poor performance. Now you’re ready for the next step: identifying high or increased load.

Identify high load

After I find out if the problem is reproing, the next thing I look at is the sources of server load. There are many sources of load such as incoming mail rates, MAPI operations, POP3 requests, or even 3^rd party software running on the server. What you look at first will depend on what you know about the system. For a typical back-end server that hosts Outlook users, I look for high rates of RPC Operations per second. If RPC rates haven’t changed, yet the health of the server has declined, I may also look for other sources of increased load, such as the incoming message rates. If I find an increase in load, I attempt to identify the cause.

In the case of high RPC load, I look to see if more than 20-30% of the load is due to a single user, or if the load is distributed across many users. I also check if the number of logons per user is excessive (greater than 4 per user) for each database. Finally, I see if the average mailbox size is high on that server, or if individual folders are excessively large (more than 5000 items in a user’s inbox, sent items, deleted items or calendar folder). Large folders and mailbox growth can lead to increased CPU and I/O load.

As I mentioned, at Microsoft, our users average 0.2 ops/sec (MSExchangeIS RPC Operations per second divided by the number of mailboxes on the server) at our peak busy time (around 9-11 am). If the whole server is even 10% higher than that for a sustained period, I suspect we’ve had an increase in load. Normally a 10% fluctuation wouldn’t be noticeable, but we keep a history of this value, so I know what is normal for our load profile. For other systems, I have to guess, which usually works fine too. I tend to get very concerned when RPC rates are higher than 0.4 ops/sec per user, though problems can occur at lower rates. If I don’t have a baseline for the unhealthy server, I compare with other servers in the same company that are healthy. Are the per-user rates higher on the server that is having trouble? If you don’t know your load profile, you just have to guess. You can use 0.20 ops/sec as an approximate baseline for active users.

If the rates are high, I run Exmon.

Use Exmon to determine if a single user is responsible for more than 20% of the server’s RPC load (on a server with more than 200 users) or 40% of the load (on a server with 200 or fewer users). I usually collect 1 minute of data every 5-10 minutes, to look for users that are consistently consuming a lot of CPU. It’s normal for some common operations to cause a user to hog CPU for a short period of time – ignore this. You’re looking for the guy that is at the top of the list most of the time. This can be more of an art than a science at times… use your best judgment.
If the RPC rate is high (for a single user or for everyone), find out if users have desktop search, 3^rd party client plug-ins or blackberry devices. Consider investigating these applications as the source of high load, and trying to reduce the load (by removing plug-ins or verifying that they are being used in an optimal fashion).

Note: even though I generally sort by the %CPU usage, this doesn’t mean I am expecting a CPU bottleneck. Actually, disk bottlenecks are the most common bottlenecks that Exchange servers encounter. I look at %CPU usage in Exmon because it is fast, and because high CPU usage will usually translate to high I/O. Some people prefer to sort on the Read Pages and PreRead columns as a more precise way to find out which users are causing the most I/O reads, and the Dirty Pages column to find which users are causing the most I/O writes.

Identify resource bottlenecks

To find the bottleneck, I usually look at most of the performance counters that are described in the “Troubleshooting Exchange Server 2003 Performance” whitepaper, though I always start with CPU and disk. For nearly all cases, I use the thresholds from the whitepaper.

Disk bottleneck

With disk, I’m mainly looking for read latencies on the databases drives, and write latencies on the storage log. I’m not going to go into all the counters and thresholds because it’s all laid out in the whitepaper. If the latencies are high, or other counters indicate a problem, the server has a disk bottleneck.

Processor bottleneck

Check if the processor is healthy. Mostly, I check that the CPU is below 80%, and that most of the CPU is coming from the store process (on a back-end machine). If CPU is higher, I know we have a CPU bottleneck. If it’s not coming from store.exe, I find out what process is hogging CPU.

Server misconfiguration

There are many other things that can impact performance. I always recommend running the ExBPA tool to ensure the server is well configured. No one can remember the thousands of configuration details to check; let ExBPA do it for you. Here are a few other things you may want to check to verify that the server configuration encourages good performance:

Are any maintenance tasks still running, or have they run recently? Make sure all maintenance tasks run during non-busy hours.
Are the transaction log drives shared with any other resource?
Are the database files, temp file, tmp file, SMTP server or system drives shared with any other resource?
Is RegTrace enabled? (leaving RegTrace enabled can cause performance issues)
Is there less than 10% free disk space on any drive used by the Exchange Server?
Is there less than 20MB free on any drive used by the Exchange Server?

Hardware problem

Occasionally, hardware is unhealthy, and that is cause of a resource bottleneck. You’re just going to have to make that judgment based on the individual circumstances. Are disk latencies high even though the throughput is low? Maybe something just isn’t performing well. If you suspect hardware isn’t living up to spec, swap it out if you can.

Resolutions

Once I know what’s going on, I can start working on suggested resolutions. I usually get the whole picture before making any changes, because most systems will exhibit many problems simultaneously. It is easy to focus on the first problem that is found, and miss another bigger problem. So, don’t act on these resolutions until you can answer both of these questions:

1) Is the problem caused by increased load?

2) Which resources are my bottlenecks?

Resolving performance problems due to increased load

If a performance problem is due to increased load, you have a couple options. First, if you have identified the source of the high load, you might be able to reduce the load – perhaps by asking users to install fixes for some of their client applications, or to stop using certain expensive applications. That’s the most obvious, but it’s not always an optoin. Next, you may want to restrict mailbox sizes, and instruct users to archive items out of folders – this also reduces load. Finally, you may decide to spread the load between servers by moving users. For example, if some users have a lot of email-intensive applications, you may want to avoid putting them on the same database or same server. On the other hand, sometimes you may do the opposite – move the extra heavy users to their own server and let them duke it out between themselves for server resources. Either way, the original server, and the rest of the users, are happier.

Resolving performance problems due to a bottleneck

If you can’t reduce the load on a server, your options are to improve the capacity of the resource that is bottlenecked, modify the configuration of the server when applicable, replace malfunctioning hardware, or move users off to another server. Increasing the capacity of a resource usually means adding more hardware. Sometimes you can increase the capacity by offloading some server work to another server, such as removing optional applications that are running on the server.

Resolving a disk bottleneck:

If any of the disks are unhealthy, and there hasn’t been an increase in load, I first check to see if the disks are used by anything else (are they shared with another Exchange server, or disk-intensive application like SQL?). Do performance problems only occur when the disks are also being accessed by the other server/program? Exchange doesn’t do well when disks are bottlenecked, and in my experience, disks that are shared often get bottlenecks. The transaction log drives, in particular, should not be shared by any other resource.

If disks are unhealthy, you have a few basic choices: move users to another server or to another database hosted on a different drive system, or increase the number of spindles for the current disk array.

Resolving a CPU bottleneck:

The resolutions for a CPU bottleneck are simple: increase the processor capacity by increasing the number of CPUs or turning on hyper-threading when applicable, move users to another server, or remove any optional applications on the server that are consuming CPU.

Resolving a memory bottleneck:

If the kernel memory is unhealthy, and the number of logons per user is high, I recommend removing users from the server or, if this is an option, reducing the number of logons per user. You can reduce logons per user by turning off 3^rd party plugins, or reducing the number of client applications per user.

The process in a nutshell…

There are many more details that I’ve left out of this blog, but I’ve covered the basics.

Here is the process in a nutshell:

Identify the symptoms.
Find out when the problem occurs.
Identify if there has been an increase in load.
1. Try to identify the cause of the increase.
2. Reduce the load if you can.
Identify the hardware bottlenecks.
Check if the bottleneck is due to a misconfiguration.
1. Fix the misconfiguration if present.
Check if the bottleneck is due to malfunctioning hardware.
1. Replace the hardware if necessary.
Finally, if 2-5 don’t resolve the problem, remove the resource bottleneck
1. Increase the capacity of the bottlenecked hardware or
2. Move users to another server that doesn’t have a bottleneck.

Sometimes you will have to iterate through the step 3-6 a few times, as resolving one bottleneck may expose another. And yeah, I hear your pain - a lot of resolutions do involve moving users to other servers. It’s just a fact of life that if the user load increases, you got to move over to make room for it. In this case, it often means more hardware, unless you can convince your users to stop any excessive activity.

Unfortunately, as I mentioned at the start of this blog, there are billions and billions of specific details (this is my Carl Sagan impression) that I have left out, but hopefully this has provided a little structure to the troubleshooting-and-problem-resolution process.

Finally, I’ll take a moment to plug my latest project – I mean, I’d like to mention that there is a new tool that is designed to take away some of the tedium of troubleshooting performance. Look for the Exchange Performance Troubleshooter Analyzer (EXPTA) 1.0 release in a few months. The tool is based on the same technology as ExBPA, and will walk you through the steps of identifying high load and bottlenecks.

- Nicole Allen

Report Inappropriate Content · ‎Sep 30 2005

if RPC Average Latency is a problem when it's greater than 50ms, then why is MOM 2005 Exchange Management pack set to alert at >200ms??

Report Inappropriate Content · ‎Sep 30 2005

The MOM Pack has a number of different thresholds than the performance guidelines. MOM thresholds are based on relatively small samples times, and lower thresholds would give many false positives. When I blog about 50ms thresholds, I am expecting these latencies to be sustained over a period of time (eg, 30 minutes or more). There are many transient reasons for counters to spike occasionally, and spikes are not always an indication of a performance problem.

Report Inappropriate Content · ‎Sep 30 2005

Nicole, excellent work 8)
Thanks for this great guideline

Report Inappropriate Content · ‎Sep 30 2005

Don't forget the App and Sys logs.
I don't know how many times I have been called to a site and found the App and Sys logs FULL of Disk Buffer Overflow errors, read write failures, and latency issues.
FIRST, check the logs.
My brother is mowing the lawn and the mower dies. HE wants to by a new mower. I check the gas tank... low and behold add gas and it works.
Keep it simple!

Report Inappropriate Content · ‎Oct 02 2005

Thanks for the article -- (of course) it can not cover everything. . .but it's a great outline for us to follow. I'm sure I'll be "borrowing" this information!

Also looking forward to your new "EXPTA" utility!

Report Inappropriate Content · ‎Oct 04 2005

So, if Microsoft sees 0.20 operations per second per user on the server, what would be the upper threshold for this value? Is 1.0 too high? Let's say you came across a server with 500 users and 2 operations per second per user...

Report Inappropriate Content · ‎Oct 04 2005

First, let me thank Adam for pointing out that the App and Sys logs are a great place to start to look for problems.

In reply to Kevin's question: there isn't a hard limit on how many operations per second per user is too much. It's just a question of how much hardware you want to buy, and how many applications the users want to run that access the server. If I saw a server with 500 users and 2 ops per second per user, I'd pull out ExMon to identify if the load was caused by a few users. I've quite often found certain apps to be capable of creating extremely high load. When even one or two users run some applications, it can have a noticeable effect on the server. Occasionally we've found an application was incorrectly configured and were able to reduce the load. On the other hand, sometimes everyone in the organization has multiple applications accessing the server, and the applications are important to their business - in this case, we just make sure the server is beefy enough to handle the load. There's just no single simple answer here. Is there enough interest in this for another blog topic?

Report Inappropriate Content · ‎Oct 13 2005

I'm definltey interested in any topics around performance and Exchange, the more information I can extract from the Blog team the better!

We do performance logging for Exchange and I know how to look for the big issues but any articles (like this one) that go into steps you would actually take are much appreciated.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Identify resource bottlenecks

Resolving performance problems due to a bottleneck