Monitoring and troubleshooting in Service Management Automation
Published Feb 15 2019 09:32 PM 1,314 Views
First published on TECHNET on Mar 18, 2014

This is a blog post that I hope you won’t have to leverage much; but as we all know there are times when our systems are not running as expected. When that happens we need to take some action to figure out what is going on and get the systems back to a healthy state. In this blog post I’ll show you how to monitor and diagnose issues that might come up in the Service Management Automation feature of Orchestrator.

I will cover three scenarios that could arise when managing the SMA infrastructure and discuss how you can receive notifications about these issues and come up with possible solutions. I rely heavily on the SMA Management Pack that shipped alongside System Center 2012 R2 to monitor SMA, and then dive into troubleshooting steps that the MP may identify.

Let’s start with a basic scenario so that we can get familiar with the Operations Manager SMA Management Pack and its capabilities. One of the things that can happen from time to time is that during maintenance of the SQL server that is hosting the SMA database, perhaps during a backup or recovery, it may be possible that the SMA database is taken offline or the SQL server itself is not started. There are rules built into the MP that will quickly identify this and show you what is happening.

As you can see in the image above, we have an alert fired in the Alerts view that indicates that a connection to the database is failing for both the SMA Runbook Worker and the SMA web service. The monitor has built-in knowledge on what the possible causes are and how to remedy the situation. You can then work to get the database back online, after which the monitor will automatically resolve when it performs the next test.

Before I go into the next scenario, I highly recommend that you read the SMA Management Pack guide so that you become familiar with all of the views, rules, and alert overrides that you will want to understand in order to tune the MP to your specific environment.

For the next scenario, let’s walkthrough what might happen if you see that runbook jobs are queuing on your runbook workers, and then discuss actions you could take. When the SMA Message Queue Length goes over the specified threshold (the default is 20 messages in the queue, but you can override this) then an alert is fired as shown below.

If you look at the Alert Details in the image below you will see information about why this alert happened, the causes for it, and the possible resolutions. It also shows the configuration of the alert so you can override these, if needed, for your specific environment.

As you can see above, a possible resolution for this is to increase the number of worker roles in your environment as this will allow you to process more jobs concurrently and therefore not build up a queue of jobs. Of course, we don’t just want to start adding Runbook Workers without understanding if we really need them, as this alert might just be an anomaly (a sudden spike in jobs for a particular reason) and once these jobs are processed from the queue the runbook worker will go back to normal operation.

You can determine this by investigating the historical performance of the Runbook Worker. If you expand the Microsoft Service Management Automation folder and click on the Performance Counters view you will be able to select the counters that you are interested in to understand what is going on.

As you can see in the image above, there has been a spike in the Message Queue Length that I created by running this PowerShell and using our SMA cmdlets :

001
002
003
004
for ($i = 0; $i -le 70;$i++) {

Start-SmaRunbook -WebServiceEndpoint $web –Name Sleep5
}

The spike only lasted for a few minutes and then went back to normal. In this scenario, you can safely leave the runbook workers as they are, since this is just a rare occurrence and it would not make sense right now to add additional workers.

However, if you start to see this happening a lot and that the queue length is continuing to grow and rarely, if ever, returns to zero, it is something you should probably be concerned about and will need to take action on. Of course, the solution recommended in the alert -- to add additional Runbook Workers -- is probably the right approach and this should get you back to a healthy state.

Before you do that, you might want to see if you are actually leveraging your existing Runbook workers to their full potential. As a fellow PM on the team, Beth Cooper, blogged about in Configure Service Management Automation for Optimum Performance it is possible to tune the Runbook Workers to process more or less jobs depending on how the system is holding up.

If you look in the above performance view screenshot, you can see that I also added the “% Processor Time” counter so I could see how the Runbook Worker is performing overall. I can then add the “Memory Consumption” counter to view if there are any constraints there and get an overall view into how this Runbook Worker is doing under normal and peak conditions. If I notice that the runbook jobs in my system are not really memory or processor intensive (like they mostly monitor and call into other systems to orchestrate tasks), I could then look to modify the configuration so it can handle more concurrent jobs.

The below is taken from Beth’s blog:

SQL speed was consistently the largest bottleneck throughout our testing, specifically in stress conditions. Be aware of the number of jobs being stored and how frequently you are writing to the database. We also found that it is best to limit the total number of concurrent jobs on any particular worker and have set the default value in the SMA configuration file Orchestrator.Settings.config.

The values in the file are:

  • MaxRunningJobs – The number of jobs that can run concurrently in a Sandbox.
  • TotalAllowedJobs – The total number of jobs that a Sandbox can process during its lifetime. When this limit is hit, the Sandbox is no longer assigned new jobs and the existing jobs are allowed to complete. After that, the Sandbox is disposed.
  • MaxRunningJobsPerWorker – The number of concurrent jobs that can run in all the existing Sandboxes on a Runbook Worker at a time.
  • MaxConcurrentSandboxes – The number of Sandboxes that can run on a Runbook Worker at once. A new Sandbox is created to handle new modules versions or to handle the case when the existing sandbox has reached the limit set on TotalAllowedJobs.

The suggested limit on the number of concurrent jobs that can be run on any particular worker (MaxRunningJobsPerWorker) defaults to 120. You can modify this number, although we don’t recommend increasing it unless you know that your workload consists mostly of non-resource-intensive runbooks such as monitoring jobs that don’t consume many resources but that run for long periods of time.

For the last scenario, I’ll show you how you can monitor runbook jobs themselves and possible actions you can take when you see alerts for these. These alerts for runbook jobs do not fire automatically, so you can configure the overrides for these if you want them to generate alerts, or else you can just look at the Jobs Failed, Jobs Stopped, or Jobs Suspended views if you just want to see how the jobs are doing overall.

In the image above, we have gotten an alert that the “Job status suspended by exception” and information on what the specific issue was that caused this job to suspend. Similar alerts can be made to fire for stopped and suspended jobs. Given this information, you can then decide if this is something that you need to address or if it is a transient issue that will resolve over time (perhaps the system the runbook is monitoring is down during maintenance).

These alerts are based off of ETW (Event Tracing for Windows) events that are written to the Operational log on the Runbook Worker and web service.

You can remote into a Runbook Worker and look at this specific log if you are doing active troubleshooting on a particular computer. One thing to notice here is that we have a guid associated with our SMA provider as shown below:

<Provider Name=" Microsoft-ServiceManagementAutomation " Guid=" {2225E960-DE42-45EA-9940-DB3C9DC96AAF} " />

This allows us to uniquely log events against this provider. If you want to get additional information about what is happening on an SMA Runbook Worker or web service you can collect even more logs (who doesn’t want more logs to process!) by enabling tracing on a specific worker or web service.

In the above command prompt, I am using the built-in windows Logman tool to start a specific data collector by specifying the GUID for SMA as shown in the Event Log view. I then can reproduce the issue I’m seeing on this particular host, and afterwards stop the trace.

This will produce dumpfile.xml and summary.txt files that have extensive details on what is going on throughout SMA during the time you are experiencing issues.

You can look through this file for a particular issue to help you get a complete view of what is going on. Hopefully you will never need to go into these traces, but if you get to a point where something strange is going on, it may prove a useful source of information. Remember to stop these trace log collectors when you are done so you don’t add any additional load.

One last thing I’d like to mention is that as I was using the management pack, although we have an alert for jobs suspended due to exception, I didn’t see a view for these events.

Thankfully, this is easily resolved in Operations Manager given its flexibility, so I just created a new view looking for event number 3186 (suspended jobs due to an exception and not stopped by a user).

That basically covers the types of monitoring and diagnostics scenarios you need to be aware of when managing SMA, and how to go about resolving issues and sizing your environment correctly. As you can imagine, there are a lot of other things that the Management Pack for SMA covers that I didn’t discuss here, so please download the MP as it is sure to be an invaluable tool.

Until next time, happy monitoring and troubleshooting SMA!

Version history
Last update:
‎Mar 11 2019 10:04 AM
Updated by: