Quick Reference: Troubleshooting, Diagnosing, and Tuning MaxConcurrentApi Issues
Published Sep 20 2018 08:04 AM 12.8K Views

First published on TechNet on Jan 12, 2014

 

Hi this is Brandon Wilson again. In my newest “Quick Reference” (get the joke?), we will be narrowing the scope for troubleshooting and tuning MaxConcurrentApi issues.

 

There is a lot of information out there on the net on this topic, but I thought it would make everyone’s life easier if I got the bulk of that information together, translated it, and added some step-by-step examples. I said to myself, “What better Christmas present than to help everyone become a MCA ninja!” Let’s not get into the fact that it’s a tad bit late and that it takes some semi-extensive reading…

I also welcome you to leave comments and questions, or suggestions for articles that you as the reader would be interested in seeing us do.

 

What is MaxConcurrentAPI anyway?

What are some of the symptoms of MCA?

How do I detect a MCA issue?

Netlogon log

UPDATE!

Netlogon performance counters

Event logs

User notification

How do I change the value of MaxConcurrentAPI?

How do I tune MCA?

Monitoring

Additional References

 

What is MaxConcurrentApi anyway?

Simply put, MaxConcurrentApi (commonly referred to as “MCA”) controls the number of concurrent NTLM authentication attempts on a Windows system; typically at the application server and/or domain controller level. Put more specifically, it defines the number of threads within lsass.exe available to Netlogon for NTLM or Digest authentication and Kerberos PAC validation functions per secure channel.

 

A MaxConcurrentAPI issue, or “bottleneck”, is when we overfill these threads and begin to time out the authentication request (typically due to slow access to domain controllers or lack of system resources). This occurs 45 seconds from the authentication “entry”; if we do not successfully authenticate a user against a domain controller within that timeframe, the authentication call will timeout (remember the phrase “Can’t allocate client API slot”), and we will throw a 0xC000005E (STATUS_NO_LOGON_SERVERS) error for the return call of the authentication. That is a MaxConcurrentApi issue in a nutshell. NOTE: A 0xC000005E error does not automatically indicate that a MaxConcurrentApi issue is occurring. We will discuss this in more detail momentarily.

NOTE: The timeout interval of 45 seconds is not configurable.

 

Fortunately, the setting for MaxConcurrentApi is configurable, which we will touch on shortly. MCA issues can be easily avoided by shifting to utilize Kerberos authentication wherever possible. Note, though, that if Kerberos PAC validation is enabled, you are still subject to MCA limitations.

 

Within this blog, any reference I make specifically to “NTLM authentication” can also be assumed to mean any authentication type covered by the Netlogon service. I managed to find a nice diagram of this out on TechNet here which outlines this point nicely:

 

 

What are some of the symptoms of MCA?

Symptoms for MCA issues are typically seen on domain controllers and application servers servicing applications that perform NTLM authentication. This is especially true of cross-forest authentication, authentication across an external trust, and authentication across WAN links. DNS misconfiguration can also lead to slow or failed authentication attempts as well, so be forewarned…plan smart!

NTLM authentication issues can present themselves in a number of ways. You may see intermittent or repetitive authentication prompts when accessing web applications, greyed out/non-functional mapped printers, failure to save files; and the list goes on….

 

Some of these symptoms, and keep in mind this can be inside of any application using NTLM authentication (Exchange, SQL, IIS, 3 rd party, etc.), can include:

 

Possible Symptoms

1. Users may be prompted for authentication even though correct credentials are used.

2. Slow authentication (may be intermittent or consistent); this may mean authentication slows through the day, or is slow from the hours of 8a-9:30a but fine the rest of the time, or any number of scenarios here.

3. Authentication may be sporadic (10 user sitting next to each other may work fine but 3 other users sitting in the same area may not be able to authenticate; or vice versa)

4. Microsoft and/or 3 rd party applications fail to authenticate users

5. Authentication fails for all users

6. Restarting the Netlogon service on the application server or domain controllers may temporarily resolve the issue

NOTE: THIS SHOULD NOT BE DONE AS A WORKAROUND AS YOU ARE MERELY PUSHING THE PROBLEM OFF TO ANOTHER MACHINE!!

NOTE: Any authentication handled by Netlogon (or Kerberos PAC validation) may experience the same or similar behavior.

 

Now let’s take a look at some specific bottleneck scenarios, and where our choke points could be. When reviewing the table, keep in mind that in front-end/back-end configurations, both the front and back ends are potential choke points. This table is meant to cover the possible choke points in a more generalized scope, just to provide pointers on where to look. We will get into a few specific scenarios momentarily:

 

Example Bottleneck Scenarios

Possible Choke Points (aka; Where Do I Collect Data?)

Application server sending credentials for users in the same domain

Scenario details :

Users in DomainB

Application server in DomainB

Domain controllers in DomainB

1. Application server

2. Domain controller from DomainB (same logical and physical site)

3. Domain controller from DomainB (same logical site name; possibly remote physical site)

4. Domain controller from DomainB (different logical site name)

Application server sending credentials for users in a different directly trusted domain (non-transitive external trust OR transitive trust with forest root as in this example)

Scenario details :

Users in DomainA

Application server in DomainB

Domain controllers in DomainB

Domain Controllers in DomainA

1. Application server

2. Domain controller from DomainB (same logical and physical site)

3. Domain controller from DomainB (same logical site; possibly remote physical site)

4. Domain controller from DomainB (different logical site)

5. Domain controller from DomainA (same logical site name)

6. Domain controller from DomainA (different logical site name)

Application server sending credentials for users in different a child domain (within the same forest)

Scenario details :

Users in DomainC

Application server in DomainB

Domain controllers in DomainB

Domain Controllers in DomainC

Domain Controllers in DomainA (forest root)

1. Application server

2. Domain controller from DomainB (same physical site)

3. Domain controller from DomainB (same logical site; possibly remote physical site)

4. Domain controller from DomainB (different logical site)

5. Domain controller from DomainA (same logical site name)

6. Domain controller from DomainA (different logical site name)

7. Domain controller from DomainC (same logical site name)

8. Domain controller from DomainC (different logical site name)

Application server in child domain sending credentials for users in the forest root in a different forest (over a forest trust)

Scenario details :

Users in DomainD

Application server in DomainB

Domain controllers in DomainB

Domain controllers in DomainA (forest root)

Domain controllers in DomainD (trusted forest root)

1. Application server

2. Domain controller from DomainB (same physical site)

3. Domain controller from DomainB (same logical site; possibly remote physical site)

4. Domain controller from DomainB (different logical site)

5. Domain controller from DomainA (same logical site name)

6. Domain controller from DomainA (different logical site name)

7. Domain controller from DomainD (same logical site name)

8. Domain controller from DomainD (different logical site name)

Application server in child domain sending credentials for users in the child domain of a different forest (over a forest trust)

Scenario details :

Users in DomainE

Application server in DomainB

Domain controllers in DomainB

Domain controllers in DomainA (forest root)

Domain controllers in DomainD (trusted forest root)

Domain controllers in DomainE (child domain of trusted forest)

1. Application server

2. Domain controller from DomainB (same physical site)

3. Domain controller from DomainB (same logical site; possibly remote physical site)

4. Domain controller from DomainB (different logical site)

5. Domain controller from DomainA (same logical site name)

6. Domain controller from DomainA (different logical site name)

7. Domain controller from DomainD (same logical site name)

8. Domain controller from DomainD (different logical site name)

9. Domain controller from DomainE (same logical site name)

10. Domain controller from DomainE (different logical site name)

And for a slightly more complex scenario as I mentioned above this table…

A front-end/back-end application server configuration (such as Microsoft Exchange) in child domain sending credentials for users in the child domain of a different forest (over a forest trust)

Scenario details :

Users in DomainE

Application server in DomainB

Domain controllers in DomainB

Domain controllers in DomainA (forest root)

Domain controllers in DomainD (trusted forest root)

Domain controllers in DomainE (child domain of trusted forest)

1. Front-end application server

2. Back-end application server

3. Domain controller from DomainB (same physical site)

4. Domain controller from DomainB (same logical site; possibly remote physical site)

5. Domain controller from DomainB (different logical site)

6. Domain controller from DomainA (same logical site name)

7. Domain controller from DomainA (different logical site name)

8. Domain controller from DomainD (same logical site name)

9. Domain controller from DomainD (different logical site name)

10. Domain controller from DomainE (same logical site name)

11. Domain controller from DomainE (different logical site name)

 

If you notice, I use the phrase “site name” a lot. This is important for authentication across trusts, because the initial query performed to identify a domain controller in the trusted forest/domain is to look for a site name identical to the site name of the proxying domain controller. If you are performing cross-forest authentication and the logical site names in Active Directory do not match, than you can end up connecting to ANY domain controller in the target forest (including those that may be across extremely slow WAN links, WAN links that span the globe, may be shut down due to local laws/regulations, etc.). This has the net result of slowing authentication down!

Some tidbits on this and how to change the behavior can be found here and here .

 

How do I detect a MCA issue?

There are 6 primary ways this issue is detected:

 

Netlogon log

This is the only way to detect a MaxConcurrentApi issue in Windows 2000 and Windows 2003. Note: Netlogon performance counters are available for systems running Windows 2003 SP1 and above with KB928576 installed.

Netlogon performance counters

Available natively in all version of Windows after Vista and available for Windows 2003 with a minimum of SP1 and KB928576 installed.

Event logs

SYSTEM event log:

In Windows 2008 R2 with SP1, installing KB2654097 will enable additional event log entries to track NTLM authentication delays and failures via Netlogon event ID 5816, 5817, 5818, or 5819.

NOTE: Windows Server 2012 and above contain these events out of the box and no hotfixes are required.

NTLM/Operational event log:

In Windows Vista/2008 and above, if you have NTLM auditing group policy settings enabled, you can collect data on the authentication requests from these logs as well. The NTLM/Operational event log is outside of the scope of this blog at this time.

User notification

User calls may come into the service desk/help desk for possible authentication issues.

ETW logging

In Windows Vista/2008 and above, you natively have the ability to utilize ETW logging for NTLM and Netlogon related providers to collect data. This is outside of the scope of this blog at this time.

A great resource that talks about ETW logging encompassing NTLM authentication can be found here .

 

Now, the most important thing to remember when reviewing MCA issues (or any other authentication issue) is how the authentication works. In the case of NTLM authentication, we have a specific path we have to follow where the authentication attempt will either be proxied up and down the forest tree over to the destination domain (in the case of a forest trust or authentication within the same forest between different child domains) or proxied directly to the destination domain (in the case of shortcut and external trusts).

 

What this means as far as data collection is that you need to collect data from all layers of the authentication chain . Let me see if I can explain better (and hopefully without confusing the matter more).

 

Visualize if you will….

 

 

We have a web server in domain B that services users from domain C and uses NTLM authentication. Both Domain B and Domain C are in the same forest, with Domain A as a parent.

 

In this case, the web server will receive the NTLM authentication request and send it to a domain controller (in the same logical site) in domain B. The domain controller in domain B will then determine the user is not in domain B and will pass the authentication request up to the forest root (domain A). The domain controller (in the same logical site) in the forest root will then make the same determination that it is not an authentication request for domain A, and proxy the authentication request back down to domain C where authentication will finally get a success or failure response (from a domain controller in the same logical site). Once success or failure is noted, the response then has to be sent back along the same chain to complete the authentication. This happens each time a resource is accessed!

 

In the above scenario, the primary potential bottlenecks are:

 

1. The web server in domain B

2. The domain controller(s) in domain B

a. If there is no domain controller within the same logical site as the system sending the authentication request, then this request can be sent to any domain controller in domain B.

3. The domain controller(s) in the forest root (domain A)

a. If there is no domain controller within the same logical site as the system sending the authentication request, then this request can be sent to any domain controller in domain A.

4. The domain controller(s) in domain C

a. If there is no domain controller within the same logical site as the system sending the authentication request, then this request can be sent to any domain controller in domain C.

NOTE: Since both the resource and user domain in this scenario are in the same forest, the authentication path can be shortened by using a shortcut trust. If you have a delay between domains in the same forest, then you may be able to better constrain, or possibly even eliminate, any delays in the authentication path.

 

NOTE 2: Note the use of the word “primary” above. In reality, the only thing missing from this list is the workstation. Although not typical, if you know of a consistent user and machine with a problem, then you can also review verbose Netlogon logs from the client/workstation as well to get the full picture from end to end. Don’t count on the client side Netlogon log providing you a root cause though, it can happen, but it’s fairly rare in my experience (think system resource issue at the client as an example).

 

Now let’s make it a bit more complex and look at authentication across a cross-forest trust and the potential bottlenecks:

 

 

Let’s say have a web server in domain B again that services users from domain E and uses NTLM authentication. Domain B is a child domain of Domain A (forest root), with domain A holding a forest trust with Domain D, which contains Domain E (the child domain holding the user accounts).

 

In this case, the web server will receive the NTLM authentication request and send it to a domain controller (in the same logical site) in Domain B. The domain controller in Domain B will then determine the user is not in Domain B and will pass the authentication request up to the forest root (Domain A). The domain controller (in the same logical site) in the forest root will then make the same determination that it is not an authentication request for Domain A, and proxy the authentication request across the forest trust to Domain D. Domain D will then also determine the user is not in Domain D and will in turn proxy the authentication to Domain E (the destination child domain holding the users) where authentication will finally get a success or failure response (from a domain controller in the same logical site). Once success or failure is noted, the response then has to be sent back along the same chain to complete the authentication. This happens every time a resource is accessed!

 

In the above scenario, the primary potential bottlenecks are:

1. The web server in domain B

2. The domain controller(s) in domain B

a. If there is no domain controller within the same logical site as the system sending the authentication request, then this request can be sent to any domain controller in domain B.

3. The domain controller(s) in the forest root (domain A)

a. If there is no domain controller within the same logical site as the system sending the authentication request, then this request can be sent to any domain controller in domain A.

4. The domain controller(s) in domain D

a. If there is no domain controller within the same logical site as the system sending the authentication request, then this request can be sent to any domain controller in domain D.

5. The domain controller(s) in domain E

a. If there is no domain controller within the same logical site as the system sending the authentication request, then this request can be sent to any domain controller in domain E.

 

And now just for a nice round view of things, let’s simplify things to an external trust point of view:

 

 

Let’s say have a web server in Domain A that services users from Domain D and uses NTLM authentication. Domain A and Domain D have an external trust with each other for this scenario. Note: This is also the same method that would apply for shortcut trusts.

 

In this case, the web server will receive the NTLM authentication request and send it to a domain controller (in the same logical site) in domain A. The domain controller in domain A will then determine the user is not in domain A and will pass the authentication request across the external trust to Domain D where the domain controller (in the same logical site) will provide a success or failure response (from a domain controller in the same logical site). Once success or failure is noted, the response then has to be sent back along the same chain to complete the authentication. This happens every time a resource is accessed!

 

In the above scenario, the primary potential bottlenecks are:

1. The web server in domain A

2. The domain controller(s) in domain A

a. If there is no domain controller within the same logical site as the system sending the authentication request, then this request can be sent to any domain controller in domain A.

3. The domain controller(s) in domain D

a. If there is no domain controller within the same logical site as the system sending the authentication request, then this request can be sent to any domain controller in domain D.

 

So let’s take a closer look at these various methods to see how to detect (and trend) the problem…

Netlogon log-

 

NOTE: Before we start, please be sure to review the “UPDATE” in this section so you can take a look at the newest way to review the Netlogon log with Message Analyzer v1.1!

 

First let’s visit the Netlogon log, which by the way is the easiest way to get granular level details for trending the problem. Detection of MCA issues via the Netlogon log is relatively straight forward; however trending data can be more confusing. You must be sure to review both the Netlogon.log and, if it exists, the Netlogon.bak file. For a quick validation, and that is the focus of this section (vs. a thorough “how to read the log file”), of whether MCA issues are occurring via the log file, you need to search for the string “Can’t allocate Client API slot”.

 

Here are a couple of methods to do this:

 

1. Open Netlogon.log (or Netlogon.bak) using notepad.exe:

a. Click the Edit menu and select Find

b. In the “Find what” text area, type the string Can’t allocate Client API slot and ensure the “Match case” checkbox is not selected

c. Click the “Find Next” button

d. If there are any matches, then you are having, or have recently had, a MaxConcurrentApi (MCA) issue. Depending on the operating system, the lines will appear like this:

6/3 14:17:43 [CRITICAL] FakeDomain: NlpUserValidateHigher: Can't allocate Client API slot.

Or like this:

6/3 14:17:43 [CRITICAL] [123]FakeDomain: NlpUserValidateHigher: Can't allocate Client API slot.

i. If there are no matches, and the authentication error being returned is 0xC000005E, then you need to look at the next level of the authentication chain. It is important to remember that the “no logon servers available”, or 0xC000005E, error is not the red flag that you have a MaxConcurrentApi issue, as there are many other possible causes for that error that are not due to authentication timeouts. To be clear, it’s not the red flag because this error code can be thrown any time there is an issue contacting a domain controller or domain, which does not necessarily equate to a MaxConcurrentApi issue. The red flag is the string “can’t allocate client API slot”.

e. Repeat steps a-e through the rest of the authentication chain to identify any and all bottlenecks. This is extremely important!!!

 

2. Using the command prompt or a cmd script:

NOTE: This is the same method used for granular analysis as well, which we will dive into shortly. I will make an assumption that the Netlogon.log and Netlogon.bak files are in the c:\temp directory for the purposes of this blog.

a. To identify MCA issues, use the following syntax:

Find /I “Can’t allocate client API slot” c:\temp\netlogon.log > c:\temp\MCA-detect-sample.txt

Find /I “Can’t allocate client API slot” c:\temp\netlogon.bak >> c:\temp\MCA-detect-sample.txt

b. If you have a MCA issue locally, then the output file will appear something like this:

 

c. On the flip side of that, if you don’t have a MCA issue, then you will not see any output other than the file name:

 

3. And of course, you can always use any tool of your choice capable of parsing text files (Log Parser, PowerShell, VBS, etc.)….

 

For reference, you will see the “Can’t allocate Client API slot” lines coupled with a timeout:

 

06/03 14:16:58 [LOGON] SamLogon: Network logon of FAKEDOMAIN\User1 from WORKSTATION1 Entered

06/03 14:17:43 [CRITICAL] FakeDomain: NlAllocateClientApi timed out: 0 258

06/03 14:17:43 [CRITICAL] FakeDomain: NlpUserValidateHigher: Can't allocate Client API slot.

06/03 14:17:43 [LOGON] SamLogon: Network logon of FAKEDOMAIN\User1 from WORKSTATION1 Returns 0xC000005E

 

Did you notice the 45 second gap between the authentication entering and returning the timeout? Keep in mind that in the real world, the majority of the time the actual timed out request won’t be this straight forward. That 45 second gap would have MANY lines in between it (100s to 1000s potentially).

For trending purposes, including identifying if the problem is with a single machine, how many and what exact users (and from what domains) are being impacted, we have to do a more granular search. While this can be done manually with notepad, I would strongly suggest parsing the log file using a script. You could also convert the text in Excel and filter through it that way pretty quickly if you prefer using that method. For my fingertips’ sake though, I am not going into that filtering method. Instead, I will provide a simple sample of trending for a specific user account.

 

Now, since we’ve already validated that we have a problem (by filtering for and confirming “can’t allocate client API slot”), the strings we are searching for (using the command above) need to change. MCA issues should have return codes of 0xC000005E, which means we can filter for those events to identify users and/or machines with problems. At this point, it’s safe to assume that the 5Es are in fact due to a MCA issue.

 

Let’s use a member server with an MCA issue as an example. Using the below commands, we get the below output:

 

Find /I "C000005E" "C:\temp\netlogon.log" > "C:\temp\MCA-detect-sample.txt"

find /I "C000005E" "C:\temp\netlogon.bak" >> "C:\temp\MCA-detect-sample.txt"

 

NOTE: Remember that a 0xC000005E error does not always indicate a MaxConcurrentApi issue. The way to verify is to run through the validation steps searching for “can’t allocate client API slot” or looking at the Netlogon\Semaphore Timeouts counter. The “5E” error is more useful in trending problem accounts and problem frequency to gauge the actual overall impact (whether the problem reported by users or not).

 

Notice how we get a timestamp, a domain name, a user name, and a machine name along with our error code. This is the ammo you need to start with for trending the issue.

 

Is it happening around 8-9am when users are coming in, and at 12:30-1:30p as users are returning from lunch?

Is the issue occurring all the time?

What users have been impacted?

 

This is only our first step to trending any specific timeframes where this might happen, where requests are coming from, etc.

 

For reference, 0xC000005E translates to STATUS_NO_LOGON_SERVERS, or in plain English, “no logon servers are available”. Other iterations of the error “translation” also exist that you may see in your event logs, but they all have the same error code and meaning.

 

Now, let’s say I want to trend further on a specific user. Let’s use “User3” as an example. For this, we need to change both the search string in the command, as well as the file we are reading. This time, we will read from the filtered file we just created to narrow the scope of the problem. For this, we use these commands to get the below output:

 

Find /I “User3” “c:\temp\netlogon.log” > “c:\temp\MCA-detect-sample-UserLevel.txt”

Find /I “User3” “c:\temp\netlogon.bak” >> “c:\temp\MCA-detect-sample-UserLevel.txt”

 

 

From this output, we can see a number of authentications occurring, and that it’s happening throughout the course of the logging. For every 45 second gap, we have timeouts. If we filter another user, we can likely see the same behavior. On that note, if you see authentication requests returning immediately with a 0xC000005E error, that is not typically caused by MaxConcurrentApi. Note that for this trend, you will need to review the “entered” line and the “returns 0xC000005E” lines to determine the amount of time the authentication took. Keep in mind that the proper authentication entry may not be the entry immediately above the return line.

 

NOTE: As an easy determination of your success/failure ratio via the logs, you can filter the Netlogon.log and Netlogon.bak file using the find command looking for the strings “returns 0x0” (successful authentication) and “returns 0xc” (which will return ANY authentication that was not successful regardless of the error code). Compare the number of successes to the number of failures and you can get a failure ratio for that machine from the log file if desired.

 

Now, just to point out a difference; if we were reviewing the Netlogon log on a domain controller that had the problem, the log would look a little different, as it will include the proxying machine name (the application server or proxying domain controller). Here is an example:

 

 

And on that note, I suppose we should talk about identifying the domain controllers. If you have the Netlogon performance counters, then good news, you can track the most recent domain controller connections based on the timestamp of a narrowed down area of the problem based on the [PERF] lines in the log. This will typically be reviewed on the application server, the domain controller, and any trusted domain controllers (each showing the next hop in the chain).

 

One example of the [PERF] lines would be:

 

12/25 01:39:03 [PERF] NlSetServerClientSession: Not changing connection (000000000A10FA48): "\\DC01.FAKEDOMAIN.LOCAL"

 

There are others as well; but in the case of the PERF entries, filtering on the [PERF] string should point you in the right direction fairly quickly (or you can always take a look at perfmon real quick!), although it will contain other performance information as well that we may not yet be interested in.

 

Without the PERF lines, it can be a little trickier, as you have identify the domain controller using other methods. An example would be hunting down the ldap ping responses. However, I would suggest either installing the hotfix for the Netlogon performance counters for singling out domain controllers (if running Windows Server 2003), or even better, enable verbose Netlogon logging on all domain controllers in the same site as the application server to expand our view and take a broader look at the big picture.

 

If the authentication is cross-forest, you will also want to enable Netlogon logging on the domain controllers residing in the same site name in the target forest—cross forest site name matching is important for speeding up authentication! An excellent reference for how domain controllers are located across trusts can be found here.

 

Now, you might be saying to yourself “oh man that can end up being a LOT of logging being enabled”, and yes that is true; but Netlogon is very dynamic in nature and you could end up establishing a secure channel with another domain controller (especially if you happened to have restarted the Netlogon service on a problem machine attempting to “fix” the problem). If you are singling out a domain controller, you could potentially end up with no useful data. More importantly than this though, is the fact that the problem could be occurring on multiple domain controllers, and we want to be very sure this isn’t the case so as to not face the same headache in the near future (or ever).

 

NOTE: If site names in cross-forest authentication scenarios do not match, than you risk being authenticated by ANY domain controller in the target forest, including those over slow links (which in turn is a common root cause of MCA issues).

 

And by now, you might be asking “But Brandon, I checked my application server and I see the “can’t allocate” errors, but when I checked my domain controllers, I don’t see any of the authentication attempts. Why is that?” And my answer would have to be to tell you good job on checking the next level of the authentication chain, and congratulations because you have found the bottleneck is at the application server. You do not see the authentication attempts because they never successfully made it to the domain controller. This indicates that the root cause is likely at or between the application server and its local domain controller and we know where to begin our deeper dive.

 

Now, WHY IS THE APPLICATION SERVER THE BOTTLENECK???!@#^$

 

Here’s where we hit home on why increasing the MaxConcurrentApi value is what I refer to as a bandage. While yes, you can increase the value and open more channels and severely lessen or even eliminate the impact, you are really just hiding the underlying problem. In most cases, the underlying problem really boils down to a few questions:

1. Am I accessing the local domain’s domain controller quickly? If so;

2. Am I accessing the forest root domain controller quickly (if applicable)? If so;

3. Am I accessing the target domain’s forest root domain controller quickly (if applicable)? If so;

4. Am I accessing the target/resource domain’s domain controller quickly?

 

At all of these levels, you can break it down into more detailed questions:

1. Is the local system healthy for the application server(s) and domain controller(s) in the authentication chain (plenty of RAM, CPU not spiked, winsock ports not exhausted, etc.)? If so;

a. In the example I used above, this would be a possible culprit, and at the application server level, because we see the timeouts, but do not see the authentication attempt appear at the domain controller.

2. What is the average time taken to authenticate a user (average semaphore hold time)?

3. Are there any timeouts (semaphore timeouts or “can’t allocate client api slot” strings)?

4. Am I accessing a domain controller in the same logical site, or the expected domain controller(s) if using SiteCoverage (Netlogon performance counters, [PERF] entries and/or DC Locator process in Netlogon logs)?

a. Do we see any straying to unexpected locations/domain controllers?

 

An authentication attempt should occur within a very short interval of time (typically less than a second); authentication delays indicate problems performing the authentication. MCA will open more channels, but the problem itself still exists.

 

There are various factors that can slow this authentication chain. Traversing the WAN; spiked network bandwidth at the router; router ACLs; DNS/name resolution latency (aka, DNS design issues); incorrect site/subnet associations in Active Directory; mismatched site names in the source and destination forests; port blockage…I could keep going on, but I already wrote a good deal on this in another blog called “Quick Reference: Troubleshooting Netlogon Error Codes” under the 0xC000005E error section. Again, another outstanding reference for how domain controllers are located across trusts can be found here. This is information that is extremely important to know when planning or troubleshooting your environment that can be a direct cause of MCA issues.

 

The bottom line here is, if it’s one of those days and you haven’t caught the hint yet, is that you need to get to the underlying problem!

 

UPDATE:

There is a new way to review the Netlogon.log file and Netlogon.bak that was first introduced with Message Analyzer 1.1. Included with the installation of Message Analyzer 1.1 was the Netlogon parser. Among other things, part of what it is capable of doing is to automatically detect the presence of a MaxConcurrentApi issue. In a nutshell, it will detect the “can’t allocate client API slot” key phrase and report back to you that “A MaxConcurrentApi issue has been detected”. Trending and other troubleshooting still remains the same as when reviewing the Netlogon log directly…

 

 

You can read more on the Netlogon parser in the Introduction to the Netlogon Parser and you can read more about basic troubleshooting in the Troubleshooting Basics for the Netlogon Parser blog.

 

Netlogon performance counters-

The Netlogon performance counters give us a great at-a-glance view of whether or not there is an issue, and are necessary to properly tune MaxConcurrentApi to an appropriate setting.

 

The available counters are:

 

 

I will talk more about these counters and what specifically they mean in the section “How do I tune MCA?”. For fast detection purposes, the Semaphore Timeouts, Semaphore Holders, and Semaphore Waiters counters are of the most interest here.

 

Netlogon Performance Counter

Desired Value

What Does It Mean?

Semaphore Timeouts

0

Any value above 0 indicates authentication timeouts are occurring.

Semaphore Waiters

0

Values above 0 in conjunction with Semaphore Holders being less than the configured MCA setting typically indicates the bottleneck is located at another level of the authentication chain.

Semaphore Holders

Less than (or equal to) MaxConcurrentApi setting

If the value is equal to the currently configured MCA setting, then you are using all available threads for authentication and a bottleneck could be at this level.

 

If you happened to notice in the “Instances of selected object” area of my screenshot above, we can quickly single out a domain controller the machine holds a secure channel with! Note that for domain controllers contacting other domains, you may see more than 1 domain controller listed here…this can be a starting point to investigating the entire authentication chain. As I mentioned when discussing detection via the Netlogon logs though, if you want to be thorough, you should expand your view of the enterprise to encompass domain controllers in specific sites.

 

Event logs-

In Windows Server 2012 and above (as well as Windows Server 2008 R2 with SP1 plus KB2654097), additional event log entries become available to track NTLM authentication delays and failures via Netlogon event ID 5816, 5817, 5818, or 5819. These events are EXTREMELY useful as for an at a glance view. I’m not going to deep dive into these here since they are outlined beautifully in KB2654097, so instead I will say, “Go give that KB a read to get a more in depth review in this area”. I will however give a breakdown of the events so you can gauge how useful they are!

 

Netlogon Event ID

Event Description

5816

Netlogon has failed an authentication request of account <username> in domain <user domain FQDN>. The request timed out before it could be sent to domain controller <directly trusted domain controller FQDN> in domain <directly trusted domain name>. This is the first failure. If the problem continues, consolidated events will be logged about every <event log frequency in minutes> minutes. Please see http://support.microsoft.com/kb/2654097 for more information.

5817

Netlogon has failed an additional <count> authentication requests in the last <event log frequency in minutes> minutes. The requests timed out before they could be sent to domain controller <directly trusted domain controller FQDN> in domain <directly trusted domain name>. Please see http://support.microsoft.com/kb/2654097 for more information.

5818

Netlogon took more than <warning event threshold> seconds for an authentication request of account <username> in domain <user domain FQDN>, through domain controller <directly trusted domain controller FQDN> in domain <directly trusted domain name>. This is the first warning. If the problem persists, a recurring event will be logged every <event log frequency in minutes> minutes. Please see http://support.microsoft.com/kb/2654097 for more information on this error.

5819

Netlogon took more than <warning event threshold> seconds for <count> authentication requests through domain controller <directly trusted domain controller FQDN> in domain <directly trusted domain name> in the last <event log frequency in minutes> minutes. Please see http://support.microsoft.com/kb/2654097 for more information.

 

The 5816 and 5817 events indicate you have failed/timed out authentication attempts whereas the 5818 and 5819 events tell you that you have authentication slowing down.

 

The event logging frequency and the warning threshold for the number of seconds can be defined in the registry. Please see KB2654097 for more details on how to do this.

 

Kerberos PAC validation failures due to MaxConcurrentApi will fail with an event ID 7 from the source of Kerberos (note that MCA is not the only potential cause of this error). An example of the text is below:

Event ID: 7

Category: None

Source: Kerberos

Type: Error

Description: The Kerberos subsystem encountered a PAC verification failure. This indicates that the PAC from the client in realm had a PAC which did not verify or was modified. Contact your system administrator.

 

User notification-

This is the most unfortunate way to identify there is an issue. User calls may come into the service desk/help desk for a persistent or intermittent failure to authenticate with applications or services, missing mapped drives, an inability to print or scan (or “greyed out” devices), or any other number of symptoms. This user calling can then expand exponentially to hundreds or thousands of users, depending on the role of the problem system and the scope of its use.

 

How do I change the value of MaxConcurrentAPI?

Before we talk about how to change these values, let’s discuss the defaults, the possible ranges, and what to watch out for. And always remember, if you are changing MCA, it should not be done without:

a) proof that the issue exists, and

b) proper tuning of MaxConcurrentApi to prevent arbitrary incrementing the value to levels that may not correct the issue, or worse, may cause system issues.

 

Proof is obtained via the Netlogon log and/or the Netlogon performance counters (Semaphore Timeouts). MaxConcurrentAPI is not a catch all solution; what I mean by that is that it is meant to be tuned on a server by server basis. Experiencing a MCA issue on one web server does not dictate that all web servers use that same value.

 

The default settings can vary between operating system levels and the role of the machine. *Most* of the time, any change in the value of MaxConcurrentApi will be at the application server or domain controller level. In my experience, it is rare for a workstation to need to change the value from the defaults.

 

NOTE: One thing to keep in mind is that the defaults are expecting expedient communications between the authenticating system (the one sending the authentication request, such as an app server) and the domain controller in the target domain.

(Table Updated on 5/30/14)

Operating System/Role

Default Threads (per secure channel)

Maximum Threads

Windows 2000 Domain Controllers

1

10

Windows 2000 Member Servers

2

10

Windows 2000 Workstations

1

10

Windows 2003/R2 Domain Controllers

1

10

Windows 2003/R2 Member Servers

2

10

Windows XP Workstations

1

10

Windows 2008 Domain Controllers

1

10 (w/o KB975363)

150 (with SP2 and KB975363)

Windows 2008 Member Servers

2

10 (w/o KB975363)

150 (with SP2 and KB975363 )

Windows Vista Workstations

1

10 (w/o KB975363 )

150 (with SP2 and KB975363)

Windows 2008 R2 Domain Controllers

1

10 (pre-SP1)

150 (with SP1 or KB975363)

Windows 2008 R2 Member Servers

2

10 (pre-SP1)

150 (with SP1 or KB975363)

Windows 7 Workstations

1

10 (pre-SP1)

150 (with SP1 or KB975363)

Windows 2012/R2 Domain Controllers

10

150 (maximum supported)

IMPORTANT: The need to utilize high values up to the maximum supported value (150) may be indicative of an underlying problem. You must identify the root cause(s); which can include but are not limited to: network configuration, network latency, DNS configuration, packet loss, and site/subnet configuration.

Windows 2012/R2 Member Servers

10

150 (maximum supported)

Please see above important note for Windows 2012/R2 Domain Controllers

Windows 8/8.1 Workstations

1

150 (maximum supported)

NOTE: It is unlikely you will need to increment MaxConcurrentApi on a workstation beyond the default value of 1.

 

That pre-Windows Server 2012 defaults sound scary, I know, but in reality in a properly configured and well-connected infrastructure, authentication times are within milliseconds and thus are processed quickly enough to not typically be a problem. Of course, Kerberos is still the way to go, because although the initial overhead of grabbing a ticket is a tad bit more work, once that ticket is cached, we don’t need to go through that process again for at least 8 hours (by default). Of course, if you are utilizing Kerberos PAC validation, you will utilize the Netlogon RPC channel and thus be subject to MaxConcurrentApi bottlenecks potentially. For reference, NTLM has to be authenticated for each call, which is a lot of overhead and that is one major reason Kerberos is a better way to go.

 

Be aware that raising the value of MaxConcurrentAPI can have performance implications. As such, it is vital that you validate performance on the system after the change. Performance impacts, which may be negligible, are typically in the memory and disk areas; specifically with overall memory usage by lsass.exe, and if verbose Netlogon logging is enabled, IO write times on the hard disk containing the Netlogon log.

 

I would strongly suggest having a performance baseline of at least the basics during peak operational times prior to the change. For our purposes here, by “the basics” I mean the following counters:

 

Performance Counter Set

Why?

Memory

We need to track the overall system memory to ensure we’re not overtaxing the system as a whole

Physical Disk or Logical Disk

We need to track the disk IO to ensure we aren’t overtaxing the disks. This is important when Netlogon logging is enabled (if you are tracking this problem, it should be)!

Process (lsass.exe minimum)

For the purposes of this baseline, lsass.exe is of the most interest because this is where Netlogon operates, however it never hurts to have a view of the processes in case a problem does arise after increasing MaxConcurrentAPI (coincidences are after all possible).

Processor

We need to monitor the processor to ensure we do not introduce pressure that may cause issues on the system.

Network Interface

Optional, but recommended

Netlogon

Tracking Netlogon enables us to get a holistic view of Netlogon. This includes delays in authentication, as well as the timeouts themselves.

This is a quick way outside of the Netlogon logs to see if you have authentication timeouts occurring.

Note that for proper trending and analysis you do still need to utilize the Netlogon logs, which will allow you to dive deeper into the problem (determine how many users were impacted, how often they were impacted, the exact error codes, the source of the “bad” authentication, etc.).

 

You do not want to arbitrarily raise the value of MaxConcurrentAPI. This is a setting that should be, at least for the most part, considered a bandage and should be tuned on a machine by machine basis, and only done to machines that exhibit MCA related issues. Doing so can waste system resources that could be used for other purposes, and can cause problems when using older or lower end hardware specs (think x86 and small amounts of RAM here especially).To that end, we will also cover tuning MaxConcurrentAPI within this blog to get you armed. Remember the saying “if it ain’t broke, don’t fix it”….it applies here.

 

There are certain exceptions to the “consider it a bandage” frame of mind. For instance, an LOB application running on a Windows Server that is heavily used that is not capable of Kerberos authentication could be a candidate for monitoring and evaluating because we will expect a high load for NTLM authentication. In this situation, if we find a MCA issue and tune for it, we may need to keep this value in place until an application capable of Kerberos authentication can replace it, which may be some time down the road.

 

With today’s server hardware with umpteen-gazillion GBs of RAM, super-fast processors, and better storage tech, you might just find that MCA has negligible impact when the system is under heavy load, even at its highest level. In that situation, then you are likely safe to keep the value higher. In doing so though, you are utilizing resources that could be used for other purposes, but more importantly, you can mask the problem if you do not have proper monitoring in place. If you mask the problem (w/o ample monitoring) with the values maxed out, then by the time you know you have a problem, you will have a really big problem! So again, I do not recommend using values that have not been tuned to the specific server.

 

Now, with all that being said, we can change the value MaxConcurrentApi using the registry. There is not a direct/native group policy counterpart to configure this setting (remember, it’s not intended to be used broadly across systems; although it can be done).

 

To change the value:

1. Open Regedit

2. Browse to HKLM\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters

3. Create a new DWORD value named “MaxConcurrentAPI” (no quotes)

4. Double click the MaxConcurrentApi value and set the data to the desired value (based on the tuning performed) in decimal

Valid range reminder:

i. Windows 2000: 1-10

ii. Windows Server XP, 2003, 2003 R2: 1-10

iii. Windows Server Vista, 7, 2008, 2008 R2: 1-150 (certain conditions apply)

iv. Windows Server 2012 and above: 1-150 (maximum supported) ** Please see the important note in the default and maximum threads table!

5. Restart the Netlogon service

 

How do I tune MCA?

Tuning MCA can be a bit tricky, and it MUST be done during peak operational times in order to obtain valid values, and I can’t stress that enough! KB928576 covers the basics of the Netlogon performance counters themselves, while KB2688798 covers tuning MaxConcurrentApi, which we will also be covering here.

 

Be advised that there is a known issue in Windows Server 2008 R2 SP1 where the Netlogon Semaphore Holders and/or Netlogon Semaphore Waiters performance counters may be incorrectly displayed in performance monitor. A hotfix (KB2685888) has been created for this issue that requires Windows Server 2008 R2 SP1 to be installed.

 

If the projected tuned MCA value is higher than the maximum value available on the operating system, and you have diagnosed and repaired any problems between the application server and domain controller, then it is an indicator that you need to add additional resources (think virtual machines or hardware). This could mean load balancing your web front end, or adding another domain controller….

So first off, let’s take a closer look at the Netlogon performance counters, and whether or not they will be used in tuning MCA (the basic technical explanation is also covered in KB928576

 

Performance Counter

Technical Explanation

The Grocery Store Scenario (an easier way to think of it…)

Used for Tuning?

Semaphore Holders

The number of threads that are holding the semaphore. This number can be any value up to the currently configured value of MaxConcurrentApi.

The number of people currently checking out at the cash registers in the grocery store

NO

Semaphore Waiters

The number of threads that are waiting to obtain the semaphore

The number of people waiting in line to checkout

NO

Semaphore Timeouts

The TOTAL number of times that a thread has timed out while waiting for the semaphore over the lifetime of the secure channel (or since system startup).

The TOTAL number of people who ran out of time to wait to checkout so abandoned the line (if the line was quicker, or there were more lines, this may not have happened!).

YES

Average Semaphore Hold Time

The average time (measured in seconds) that the semaphore is held over the last sample

The average time it took a person to get through the line and checkout

YES

Semaphore Acquires

The TOTAL number of times the semaphore has been obtained over the lifetime of the secure channel (or since system startup)

The TOTAL number of people that successfully went through the checkout

YES

 

The most precise method of determining what the value of MaxConcurrentAPI should be for your servers is to utilize the Netlogon performance counters. For an accurate assessment, this must be done from all servers in the authentication chain. This means the application server(s), domain controllers in the same domain/site as the application server, and any domain controllers being contacted in the trusted domain(s).

 

Netlogon counters and logs can be used to refine what domain controllers in the local and trusted domain to target if you need to streamline data collection. If the Netlogon performance counters are on the system (Windows 2003 and up), the domain controller connections can be easily tracked with the [PERF] lines within the Netlogon log, however performance monitor offers you an at a glance view of the domain controller the system currently holds a secure channel with (very handy indeed).

 

If you are streamlining the data collection and targeting individual domain controllers, I also strongly suggest collecting a network trace during the problem, if at all possible. There is a possibility of missing the data you need when you target specific domain controllers, but more importantly, the problem could also exist on other machines that have thus far gone unnoticed. It’s for this reason that I recommended data collection from all domain controllers in the same site. The same could be said for enabling logging on all application servers (all nodes in a cluster; all servers in a load balanced configuration; etc.).

 

If you are running Windows 2008 R2 or above, you can use netsh to collect a network trace with the following command (this is one variation of many that will work):

 

Netsh trace start capture=yes tracefile=c:\temp\MCAdetect.etl

 

You stop the trace by running the command while logged on with the same account that started the trace:

 

Netsh trace stop

 

This sequence of commands will generate an ETL file with the name you provided (MCAdetect.etl in this example) in the path you provided; along with a cab file that contains other network information (as well as another copy of the ETL). This ETL can be opened and viewed with Message Analyzer or Network Monitor 3.x when used in conjunction with the full Windows parsers.

 

The counters of interest when tuning MaxConcurrentApi are the TOTALS for Netlogon semaphore acquires, semaphore time-outs, and average semaphore hold time. You also want to track the number of seconds you are monitoring the data in Performance Monitor. This should not be a long sample interval; 90-120 seconds should be sufficient. Remember that the Performance Monitor report view is your friend as well, and it is used to determine your average semaphore hold time.

 

The formula for determining your ideal setting (per machine) is:

 

(semaphore acquires + semaphore timeouts) * avg semaphore hold time / collection time = < MaxConcurrentAPI>

 

As an example formula, I will first refer to the example provided in KB2688798. Let’s say we find the following values from our performance measurements:

 

Semaphore acquires = 8286

Semaphore time-outs = 883

Average semaphore hold time = .5 seconds

Performance monitoring duration = 90 seconds

 

Our formula in the above scenario becomes: (8286 + 883) * .5 / 90 = < 51 (50.938)

 

Remember!: If the value of the formula is approaching, equal to, or greater than the MaxConcurrentAPI maximum for the operating system, then this would indicate that more application servers or domain controllers are necessary to support the legacy authentication load (depending on where the bottleneck is detected).

 

The KB contains an outstanding example, and it gives us a nice round setting to use. But if you look at the average semaphore hold time…a half a second…now that’s a pretty good delay in authenticating. So although we now have a tuned number, we also know that authentication is occurring much more slowly than desired. As a result, we need to be digging deeper…why is the authentication request so slow?

1. Do we have a site/subnet configuration problem (missing site; missing subnet; improper subnet-to-site affiliation; etc.)?

2. Are we taking a slow network path?

3. Are domain controllers registered in the site the problem machine belongs to?

4. If crossing a trust for authentication, does the site name that the proxying domain controller belongs to exist in the target domain/forest?

5. Do we have port blockage?

6. ……

 

As you can see, this list can go on, as there are numerous potential causes for a slow authentication attempt. In the meantime, we can have a bandage in place to work around the problem until we find the root cause.

 

Of course, since we’re talking about the real world here…things don’t always work out that way. If for instance you run across a root cause that cannot be circumvented; let’s say an extremely slow WAN link that cannot be upgraded until the next budget review/approval cycle the following year. This is an example of a situation where MaxConcurrentApi should be tuned and left in place until such time that the WAN link upgrade is complete or until domain controllers from the trusted domains can be added to an area that contains a fast network link (aka, the same LAN segment).

 

To provide a real world example for the detection, let’s look at a BLG from an application server that runs IIS, FTP, as well as file and print server functions where data collection was enabled hours into the problem and was not being monitored prior to the issue.

 

Now, let’s take a look at the number of timeouts in this almost 2 minute duration. If you take the maximum timeouts (which is the right end of the graph) minus the minimum timeouts (which is the left end of the graph), you get the total number of timeouts. In this case, there were 2253 timeouts in less than 2 minutes. Now that’s certainly a large value for a counter than should be at zero! Houston, we have a problem! And based on the number of timeouts at the minimum level, it’s a BIG problem….

 

 

Now that we know we have a problem; maybe we want a quick view of how many authentications are waiting in line. For this we can take a look at the Semaphore Waiters counter. This provides us a quick view of potential impact. We can expect at least a subset of these “waiters” to timeout as well with timeouts occurring at the frequency the performance log shows. For this example, we can see a maximum of 2157 “waiters”; this means 2157 authentications queued up and just waiting to be authenticated…

 

 

Just from a glance that would take us just a minute to look at in all, we can see that we have a fairly large MaxConcurrentApi issue that we need to address.

 

So, let’s take this scenario and figure out what we need to tune to. In the example here, let’s say I have a BLG that spans 12 hours to start with. First, let’s lower that sample interval significantly, and within peak operational timeframes. I will knock it down to 90 seconds for our purposes here and ensure I am viewing the Netlogon counters (_Total) for Semaphore Acquires, Semaphore Timeouts, and Average Semaphore Hold Time. These are the counters we need for tuning MCA, as seen in the screenshot below.

 

 

As of right now, we can put in the collection time into our formula, because we already know that. As a result, our formula becomes:

 

(semaphore acquires + semaphore timeouts) * avg semaphore hold time / 90 = < MaxConcurrentAPI>

 

Now let’s take a look at our next easy to find value, the Average Semaphore Hold Time. To see this, we need to go to the Report view in Performance Monitor to get a view like this:

 

 

So at this point, now we know 2 of our variable numbers and can adjust the formula to be:

 

(semaphore acquires + semaphore timeouts) * .098 / 90 = < MaxConcurrentAPI>

 

Cool! Now we only need to determine the semaphore acquires and semaphore timeouts and we can do some math!

 

Both the semaphore acquires and semaphore timeouts values are cumulative, which means in order to determine the values to use in our formula, we need to do a bit of subtraction.

 

Let’s look at semaphore acquires. The first image below is the starting value (the value we will be subtracting), and the second image is the ending value (the value the starting value will be subtracted from).

 

 

In this case, for the Semaphore Acquires counter we have a difference of 1833 (729272 – 727439 = 1833), which provides us a new value for our formula:

 

(1833 + semaphore timeouts) * .098 / 90 = < MaxConcurrentAPI>

 

The last item we need to figure out is the timeouts. We make this determination the exact same way we do for Semaphore Acquires. Let’s take a look at the starting and ending values (images 1 and 2 below).

 

 

For the Semaphore Timeouts counter we have a difference of 1983 (260087 – 258104 = 1983), which provides us a new, and the last, value for our formula, allowing us to calculate our tuned value:

 

(1833 + 1983) * .098 / 90 = 4.1552

 

Since this value is slightly above 4, our safest bet in this instance would be to tune MaxConcurrentApi to a setting of 5 while we try to determine the true root cause to resolve the issue. You *might* be able to slide by with 4 in this case, but it’s better to err on the side of caution.

 

NOTE: There is also a script out there that can review and determine if a MCA issue is occurring (it can be found here), and at this point in the blog, you should have a grasp on scripting this detection yourself as well.

 

Monitoring

Hopefully during the course of this blog, the thing mostly left unsaid has sunk in. Proactive monitoring is the best way to avoid costly outages due to legacy authentication mechanisms (and Kerberos PAC validation). Regardless of the monitoring method you use (SCOM or 3 rd party product), I would suggest expanding it to look for issues as we’ve covered here. The SCOM management pack has been expanded to include MaxConcurrentApi issue detection.

 

 

Brandon “long-winded” Wilson

 

Version history
Last update:
‎Feb 19 2020 08:22 PM
Updated by: