We in the Exchange product group get this question from time to time. The first thing we ask in response is always, “What was the customer impact?” In some cases, there is customer impact; these may indicate bugs that we are motivated to fix. However, in most cases there was no customer impact: a service restarted, but no one noticed. We have learned while operating the world’s largest Exchange deployment that it is fantastic when something is fixed before customers even notice. This is so desirable that we are willing to have a few extra service restarts as long as no customers are impacted.
You can see this same philosophy at work in our approach to database failovers since Exchange 2007. The mantra we have come to repeat is, “Stuff breaks, but the user experience doesn’t!” User experience is our number one priority at all times. Individual service uptime on a server is a less important goal, as long as the user experience remains satisfactory.
However, there are cases where Managed Availability cannot fix the problem. In cases like these, Exchange provides a huge amount of information about what the problem might be. Hundreds of things are checked and tested every minute. Usually, Get-HealthReport and Get-ServerHealth will be sufficient to find the problem, but this blog post will walk you through getting the full details from an automatic recovery action to the results of all the probes by:
Every time Managed Availability takes a recovery action, such as restarting a service or failing over a database, it logs an event in the Microsoft.Exchange.ManagedAvailability/RecoveryActions crimson channel. Event 500 indicates that a recovery action has begun. Event 501 indicates that the action that was taken has completed. These can be collected via the MMC Event Viewer, but we usually find it more useful to use PowerShell. All of these Managed Availability recovery actions can be collected in PowerShell with a simple command:
$RecoveryActionResultsEvents = Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ManagedAvailability/RecoveryActionResults
We can use the events in this format, but it is easier to work with the event properties if we use PowerShell’s native XML format:
$RecoveryActionResultsXML = ($RecoveryActionResultsEvents | Foreach-object -Process {[XML]$_.toXml()}).event.userData.eventXml
Some of the useful properties for this Recovery Action event are:
So for example, if you wanted to know why MSExchangeRepl was restarted on your server around 9:30PM, you could run a command like this:
$RecoveryActionResultsXML | Where-Object {$_.State -eq "Finished" -and $_.ResourceName –eq "MSExchangeRepl" -and $_.EndTime -like "2013-06-12T21*"}| ft -AutoSize StartTime,RequestorName
This results in the following output:
|
|
|
|
|
|
The RequestorName property indicates the name of the Responder that took the action. In this case, it was ServiceHealthMSExchangeReplEndpointRestart. Often, the responder name will give you an indication of the problem. Other times, you will want more details.
Monitors are the central part of Managed Availability. They are the primary means, through Get-ServerHealth and Get-HealthReport, by which an administrator can learn the health of a server. Recall that a Health Set is a grouping of related Monitors. This is why much of our troubleshooting documentation is focused on these objects. It will often be useful to know what Monitors and Health Sets are repeatedly unhealthy in your environment.
Every time the Health Manager service starts, it logs events to the Microsoft.Exchange.ActiveMonitoring/ResponderDefinition crimson channel, which we can use to get the properties of the Responders we found in the last step by the RequestorName property. First, we need to collect the Responders that are defined:
$DefinedResponders = (Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ActiveMonitoring/ResponderDefinition | % {[xml]$_.toXml()}).event.userData.eventXml
One of these Responder Definitions will match the Recovery Action’s RequestorName. The Monitor that controls the Responder we are interested in is defined by the AlertMask property of that Definition. Here are some of the useful Responder Definition properties:
To get the Monitor for the ServiceHealthMSExchangeReplEndpointRestart Responder, you run:
$DefinedResponders | ? {$_.Name –eq "ServiceHealthMSExchangeReplEndpointRestart"} | ft -a Name,AlertMask
This results in the following output:
|
|
|
|
|
|
Many Monitor names will give you an idea of what to look for. In this case, the ServiceHealthMSExchangeReplEndpointMonitor Monitor does not tell you much more than the Responder name did. The Technet article on Troubleshooting DataProtection Health Set lists this Monitor and suggests running Test-ReplicationHealth. However, you can also get the exact error messages of the Probes for this Monitor with a couple more commands.
Remember that Monitors have their definitions written to the Microsoft.Exchange.ActiveMonitoring/MonitorDefinition crimson channel. Thus, you can get these in a similar way as the Responder definitions in the last step. You can run:
$DefinedMonitors = (Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ActiveMonitoring/MonitorDefinition | % {[xml]$_.toXml()}).event.userData.eventXml
Some useful properties of a Monitor definition are:
To get the SampleMask for the identified Monitor, you can run:
($DefinedMonitors | ? {$_.Name -eq ‘ServiceHealthMSExchangeReplEndpointMonitor’}).SampleMask
This results in the following output:
ServiceHealthMSExchangeReplEndpointProbe
Now that we know what Probes to look for, we can search the Probes’ definition channel. Useful properties for Probe Definitions are:
To get definitions of this Monitor’s Probes, you can run:
(Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -like “ServiceHealthMSExchangeReplEndpointProbe*”} | ft -a Name, TargetResource
This results in the following output:
Name |
|
|
|
|
|
|
|
|
|
Remember, not all Monitors use synthetic transactions via Probes. See this blog post for the other ways Monitors collect their information.
This Monitor has three Probes that can cause it to become Unhealthy. You’ll see that they are named such that each is named with the Monitor’s SampleMask, but are then differentiated. When getting the Probe Results in the next step, the Probes will also have the TargetResource in their ServiceName.
Now that we know all the Probes that could have failed, but we don’t yet know which did or why.
There are many Probes and they execute often, so the channel where they are logged (Microsoft.Exchange.ActiveMonitoring/ProbeResult) generates a lot of data. There will often only be a few hours of data, but the Probes we are interested in will probably have a few hundred Result entries. Here are some of the Probe Result properties you may be interested in for troubleshooting:
Some Probes may use some of the other available fields to provide additional data about failures.
We can use XPath to filter the large number of events to just the ones we are interested in; those with the ResultName we identified in the last step and with a ResultType of 4 indicating that they failed:
$replEndpointProbeResults = (Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ActiveMonitoring/ProbeResult -FilterXPath "*[UserData[EventXML[ResultName='ServiceHealthMSExchangeReplEndpointProbe/RPC/MSExchangeRepl'][ResultType='4']]]" | % {[XML]$_.toXml()}).event.userData.eventXml
To get a nice graphical view of the Probe’s errors, you can run:
$replEndpointProbeResults | select -Property *Time,Result*,Error*,*Context,State* | Out-GridView
In this case, the full error message for both Probe Results suggests making sure the MSExchangeRepl service is running. This actually is the problem, as for this scenario I restarted the service manually.
This article is a detailed look at how you have access to an incredible amount of information about the health of Exchange Servers. Hopefully, you will not often need it! In most cases, the alerts will be enough notification and the included cmdlets will be sufficient for investigation.
Managed Availability is built and hardened at scale, and we continuously analyze these same events collected in this article so that we can either fix root causes or write Responders to fix more problems before users are impacted. In those cases where you do need to investigate a problem in detail, we hope this post is a good starting point.
Abram Jackson
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.