As a savvy customer who creates automation solutions for managing the services enabled with the Windows Azure Pack (WAP) you are of course getting very familiar with
the new Service Management Automation (SMA) capability
. SMA enables you to automate the repetitive and complex tasks in your datacenter and to integrate with other systems in order to save you time and money and reduce errors. SMA provides you with a complete solution for creating and managing the PowerShell workflows that power your automation.
Once you start creating workflows and associated modules and settings, and you start using them on a daily basis, you will want to have an operators view into what resources are in the system, when the workflows are running, what their current states are, and how they are performing. Simply put, you will want to know how all of your automation is doing at all times. And if some unexpected issue arises you will want to be able to simply and quickly troubleshoot and debug the issue and get the automation back online.
SMA helps you out here. SMA provides features for both monitoring the state of your workflows and for troubleshooting your automation jobs. In this post, let’s take a tour through these features. (In a future post we will cover the SMA management pack and monitoring of the overall health of the SMA infrastructure.)
When you first open the WAP service admin portal and navigate to the Automation resource you will be presented with the all-up automation dashboard.
In this dashboard, there are three key sections that inform you about the system:
1. At the top of the dashboard is a
that shows you the status of each automation job during the time period you choose (from the last hour up to the last 30 days). In this chart you can quickly see how many jobs are running right now, how many have completed, and most importantly you can see any jobs that require your attention – those that are suspended or failed. At the top of the chart is a row of icons that represent each possible job state: you can click on these icons to toggle on and off particular status lines and thereby allow you to focus on job states of particular interest.
2. Just below the chart is the
. This table contains an entry for each job represented in the chart, and shows the name of the runbook (workflow), the time the job was last updated, and the current status of the job. Thus, if there are any jobs that require your attention, you can use this table to quickly identify the exact jobs and then drill in to troubleshoot.
3. Just to the right of the jobs table is the
section. This section contains useful static information about system, such as the number of runbooks, modules, and settings. It also indicates how many runbooks are currently in an authoring state.
As a manager of this automation system, from this dashboard you immediately notice that there are several jobs with the state of Suspended (yellow line in the screenshot above). Because automation jobs are expected to run to completion (unless the runbook author intended it to suspend), this is a problem, and you will want to troubleshoot the issue. One thing you notice is that the same runbook, Delete-Subscription, is being suspended. At this point you could click on one of the Job Last Update links in the jobs table and drill directly to the job details; however in this case you want to see when the jobs started suspending, so in the jobs table you click the Delete-Subscription runbook name and navigate to the dashboard for that runbook to get a historical view of the jobs.
looks very similar to the
, with a chart, a jobs table, and a quick glance section. However, the information in the runbook dashboard is scoped entirely to a single runbook.
section, you can see when this runbook was last published and who published it. You can also see its authoring status – is it currently being edited or is it done and published. And you can see if the runbook has been configured to run on any schedules.
Because you are trying to figure out why this runbook has been suspended the last five times it has run, you need more information, so in the
table you click the
Job Last Update
time for each job and navigate to the associated
In the Job Dashboard you are presented with summary of the particular job. You can see the name of the runbook and the current job status, plus who started the job, when it started, and when it was last updated (when it became suspended). Also, you can see the names and values of any input parameters to the runbook and any output from the runbook.
This is useful, but it hasn’t given us enough information to get to the bottom of the issue. Therefore, let’s click the
tab to view detailed information about each step of the runbook execution.
For every PowerShell workflow that runs, the workflow engine emits several streams that contain useful information. These streams are the Progress, Output, Warning, Error, Verbose, and Debug streams. By default in SMA the Progress, Verbose, and Debug streams are not stored for each job (because the data storage can become large, especially for the Progress streams); however, you can enable them in the runbook configuration page if you need the information for debugging and troubleshooting.
page for a job contains a list with all of the streams that were stored, sorted by the creation time of the stream. Thus, you can use this page to quickly drill down and see what happened in each step of the runbook as it ran. Because SMA can store this information and then retrieve it for you, you can consider authoring your runbooks to emit information into these streams to assist with troubleshooting.
If you want even more detailed information you can choose to view the details of any stream or view the source code to remind yourself what may be happening in the code to cause the problem. Even if the runbook has changed since this job was run, the source code used in this specific job will be shown.
As a savvy Automation operator, you have viewed the
for each of the suspended jobs and have seen that the error is due to an invalid connection with the VMM server. You know from a quick view of the code that the Delete-Subscription runbook uses the VmmConnection setting, and then you remember that the password for this connection was recently changed. Now you understand what the problem is! The fix is easy – you edit the VmmConnection connection setting and update the password.
With that fix in place, you can re-start each of the suspended jobs, and they will start from the last checkpoint (more on checkpointing in a future blog post) and finish the work they were doing.
In Service Management Automation the tools are in place for you to completely manage your automation for WAP services. From creating new runbooks, to running jobs, to monitoring and troubleshooting issues, you can control the entire experience.