What Did Managed Availability Just Do To This Service?

The_Exchange_Team · ‎Jun 13 2013

We in the Exchange product group get this question from time to time. The first thing we ask in response is always, “What was the customer impact?” In some cases, there is customer impact; these may indicate bugs that we are motivated to fix. However, in most cases there was no customer impact: a service restarted, but no one noticed. We have learned while operating the world’s largest Exchange deployment that it is fantastic when something is fixed before customers even notice. This is so desirable that we are willing to have a few extra service restarts as long as no customers are impacted.

You can see this same philosophy at work in our approach to database failovers since Exchange 2007. The mantra we have come to repeat is, “Stuff breaks, but the user experience doesn’t!” User experience is our number one priority at all times. Individual service uptime on a server is a less important goal, as long as the user experience remains satisfactory.

However, there are cases where Managed Availability cannot fix the problem. In cases like these, Exchange provides a huge amount of information about what the problem might be. Hundreds of things are checked and tested every minute. Usually, Get-HealthReport and Get-ServerHealth will be sufficient to find the problem, but this blog post will walk you through getting the full details from an automatic recovery action to the results of all the probes by:

Finding the Managed Availability Recovery Actions that have been executed for a given service.
Determining the Monitor that triggered the Responder.
Retrieving the Probes that the Monitor uses.
Viewing any error messages from the Probes.

Finding Recovery Actions

Every time Managed Availability takes a recovery action, such as restarting a service or failing over a database, it logs an event in the Microsoft.Exchange.ManagedAvailability/RecoveryActions crimson channel. Event 500 indicates that a recovery action has begun. Event 501 indicates that the action that was taken has completed. These can be collected via the MMC Event Viewer, but we usually find it more useful to use PowerShell. All of these Managed Availability recovery actions can be collected in PowerShell with a simple command:

$RecoveryActionResultsEvents = Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ManagedAvailability/RecoveryActionResults

We can use the events in this format, but it is easier to work with the event properties if we use PowerShell’s native XML format:

$RecoveryActionResultsXML = ($RecoveryActionResultsEvents | Foreach-object -Process {[XML]$_.toXml()}).event.userData.eventXml

Some of the useful properties for this Recovery Action event are:

Id: The action that was taken. Common values are RestartService, RecycleAppPool, ComponentOffline, or ServerFailover.
State: Whether the action has started (event 500) or finished (event 501).
ResourceName: The object that was affected by the action. This will be the name of a service for RestartService actions, or the name of a server for server-level actions.
EndTime: The time the action completed.
Result: Whether the action succeeded or not.
RequestorName: The name of the Responder that took the action.

So for example, if you wanted to know why MSExchangeRepl was restarted on your server around 9:30PM, you could run a command like this:

$RecoveryActionResultsXML | Where-Object {$_.State -eq "Finished" -and $_.ResourceName –eq "MSExchangeRepl" -and $_.EndTime -like "2013-06-12T21*"}| ft -AutoSize StartTime,RequestorName

This results in the following output:

`StartTime`	`RequestorName`
`---------`	`-------------`
`2013-05-12T21:49:18.2113618Z`	`ServiceHealthMSExchangeReplEndpointRestart`

The RequestorName property indicates the name of the Responder that took the action. In this case, it was ServiceHealthMSExchangeReplEndpointRestart. Often, the responder name will give you an indication of the problem. Other times, you will want more details.

Finding the Monitor that Triggers a Responder

Monitors are the central part of Managed Availability. They are the primary means, through Get-ServerHealth and Get-HealthReport, by which an administrator can learn the health of a server. Recall that a Health Set is a grouping of related Monitors. This is why much of our troubleshooting documentation is focused on these objects. It will often be useful to know what Monitors and Health Sets are repeatedly unhealthy in your environment.

Every time the Health Manager service starts, it logs events to the Microsoft.Exchange.ActiveMonitoring/ResponderDefinition crimson channel, which we can use to get the properties of the Responders we found in the last step by the RequestorName property. First, we need to collect the Responders that are defined:

$DefinedResponders = (Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ActiveMonitoring/ResponderDefinition | % {[xml]$_.toXml()}).event.userData.eventXml

One of these Responder Definitions will match the Recovery Action’s RequestorName. The Monitor that controls the Responder we are interested in is defined by the AlertMask property of that Definition. Here are some of the useful Responder Definition properties:

TypeName: The full code name of the recovery action that will be taken when this Responder executes.
Name: The name of the Responder.
TargetResource: The object this Responder will act on.
AlertMask: The Monitor for this Responder.
WaitIntervalSeconds: The minimum amount of time to wait before this Responder can be executed again. There are other forms of throttling that will also affect this Responder.

To get the Monitor for the ServiceHealthMSExchangeReplEndpointRestart Responder, you run:

$DefinedResponders | ? {$_.Name –eq "ServiceHealthMSExchangeReplEndpointRestart"} | ft -a Name,AlertMask

This results in the following output:

`Name`	`AlertMask`
`----`	`---------`
`ServiceHealthMSExchangeReplEndpointRestart`	`ServiceHealthMSExchangeReplEndpointMonitor`

Many Monitor names will give you an idea of what to look for. In this case, the ServiceHealthMSExchangeReplEndpointMonitor Monitor does not tell you much more than the Responder name did. The Technet article on Troubleshooting DataProtection Health Set lists this Monitor and suggests running Test-ReplicationHealth. However, you can also get the exact error messages of the Probes for this Monitor with a couple more commands.

Finding the Probes for a Monitor

Remember that Monitors have their definitions written to the Microsoft.Exchange.ActiveMonitoring/MonitorDefinition crimson channel. Thus, you can get these in a similar way as the Responder definitions in the last step. You can run:

$DefinedMonitors = (Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ActiveMonitoring/MonitorDefinition | % {[xml]$_.toXml()}).event.userData.eventXml

Some useful properties of a Monitor definition are:

Name: The name of this Monitor. This is the same name reported by Get-ServerHealth.
ServiceName: The name of the Health Set for this Monitor.
SampleMask: The substring that all Probes for this Monitor will have in their names.
IsHaImpacting: Whether this Monitor should be included when HaImpactingOnly is specified by Get-ServerHealth or Get-HealthReport.

To get the SampleMask for the identified Monitor, you can run:

($DefinedMonitors | ? {$_.Name -eq ‘ServiceHealthMSExchangeReplEndpointMonitor’}).SampleMask

This results in the following output:

ServiceHealthMSExchangeReplEndpointProbe

Now that we know what Probes to look for, we can search the Probes’ definition channel. Useful properties for Probe Definitions are:

Name: The name of the Probe. This will begin with the SampleMask of the Probe’s Monitor.
ServiceName: The Health Set for this Probe.
TargetResource: The object this Probe is validating. This is appended to the Name of the Probe when it is executed to become a Probe Result ServiceName.
RecurrenceIntervalSeconds: How often this Probe executes.
TimeoutSeconds: How long this Probe should wait before failing.

To get definitions of this Monitor’s Probes, you can run:

(Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -like “ServiceHealthMSExchangeReplEndpointProbe*”} | ft -a Name, TargetResource

This results in the following output:

Name	`TargetResource`
`----`	`--------------`
`ServiceHealthMSExchangeReplEndpointProbe/ServerLocator`	`MSExchangeRepl`
`ServiceHealthMSExchangeReplEndpointProbe/RPC`	`MSExchangeRepl`
`ServiceHealthMSExchangeReplEndpointProbe/TCP`	`MSExchangeRepl`

Remember, not all Monitors use synthetic transactions via Probes. See this blog post for the other ways Monitors collect their information.

This Monitor has three Probes that can cause it to become Unhealthy. You’ll see that they are named such that each is named with the Monitor’s SampleMask, but are then differentiated. When getting the Probe Results in the next step, the Probes will also have the TargetResource in their ServiceName.

Now that we know all the Probes that could have failed, but we don’t yet know which did or why.

Getting Probe Error Messages

There are many Probes and they execute often, so the channel where they are logged (Microsoft.Exchange.ActiveMonitoring/ProbeResult) generates a lot of data. There will often only be a few hours of data, but the Probes we are interested in will probably have a few hundred Result entries. Here are some of the Probe Result properties you may be interested in for troubleshooting:

ServiceName: The Health Set of this Probe.
ResultName: The Name of this Probe, including the Monitor’s SampleMask, an identifier of the code this Probe executes, and the resource it verifies. The target resource is appended to the Probe’s name we found in the previous step. In this example, we append /MSExchangeRepl to get ServiceHealthMSExchangeReplEndpointProbe/RPC/MSExchangeRepl.
Error: The error returned by this Probe, if it failed.
Exception: The callstack of the error, if it failed.
ResultType: An integer that indicates one of these values:

1: Timeout
2: Poisoned
3: Succeeded
4: Failed
5: Quarantined
6: Rejected

ExecutionStartTime: When the Probe started.
ExecutionEndTime: When the Probe completed.
ExecutionContext: Additional information about the Probe’s execution.
FailureContext: Additional information about the Probe’s failure.

Some Probes may use some of the other available fields to provide additional data about failures.

We can use XPath to filter the large number of events to just the ones we are interested in; those with the ResultName we identified in the last step and with a ResultType of 4 indicating that they failed:

$replEndpointProbeResults = (Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ActiveMonitoring/ProbeResult -FilterXPath "*[UserData[EventXML[ResultName='ServiceHealthMSExchangeReplEndpointProbe/RPC/MSExchangeRepl'][ResultType='4']]]" | % {[XML]$_.toXml()}).event.userData.eventXml

To get a nice graphical view of the Probe’s errors, you can run:

$replEndpointProbeResults | select -Property *Time,Result*,Error*,*Context,State* | Out-GridView

In this case, the full error message for both Probe Results suggests making sure the MSExchangeRepl service is running. This actually is the problem, as for this scenario I restarted the service manually.

Summary

This article is a detailed look at how you have access to an incredible amount of information about the health of Exchange Servers. Hopefully, you will not often need it! In most cases, the alerts will be enough notification and the included cmdlets will be sufficient for investigation.

Managed Availability is built and hardened at scale, and we continuously analyze these same events collected in this article so that we can either fix root causes or write Responders to fix more problems before users are impacted. In those cases where you do need to investigate a problem in detail, we hope this post is a good starting point.

Abram Jackson

Report Inappropriate Content · ‎Jun 14 2013

Thanks for the info

I hope somebody will write a gui for this:) like the gui for RBAC

Report Inappropriate Content · ‎Jul 22 2013

When do you plan to update your troubleshooting documentation? It's lacking information on a lot of HealthSet rigth now.

Report Inappropriate Content · ‎Jul 22 2013

Awesome article Abram! Keep up the good work.

Dame

thelifestrategist.wordpress.com

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

What Did Managed Availability Just Do To This Service?

Finding Recovery Actions

Finding the Monitor that Triggers a Responder

Finding the Probes for a Monitor

Getting Probe Error Messages

Summary