managed availability
7 TopicsExchange 2013/2016 Monitoring Mailboxes
Note 10/2022: This article also applies to Exchange Server 2019. Introduction Exchange Server 2013 introduced a new feature called Managed Availability, which is a built-in monitoring system with self-recovery capabilities. Managed Availability performs continuous tests (probes) that simulate end-user actions, to detect possible problems with Exchange components or their dependencies. If probes are failing, it performs gradual simple recovery actions to bring the affected component in healthy state. It uses special type of mailboxes, called Monitoring Mailboxes or health mailboxes, to simulate end-user kinds of tests. The life cycle of monitoring mailboxes is taken care entirely by Managed Availability components. In this post, we’ll see how Managed Availability takes care of monitoring mailboxes, what best practices to keep monitoring mailboxes happy are and some related troubleshooting. Monitoring Mailboxes Functioning of Managed Availability is implemented by Microsoft Exchange Health Manager Service running on each Exchange Server 2013 role. The Microsoft Exchange Health Manager Service is responsible for creating and maintaining monitoring mailboxes Let’s take a look at how the Health Manager creates them! How do we create monitoring mailboxes? The MS Exchange Health Manager Service runs in two processes, MSExchangeHMHost.exe and MSExchangeHMWorker.exe (let’s call it the HM worker). HM worker process, at the time of startup, checks for availability of monitoring mailboxes and creates monitoring mailboxes as needed. Starting with Exchange Server 2013 Cumulative Update 1, accounts for monitoring mailboxes are created under following container in the domain Exchange server resides: <ADdomain>\Microsoft Exchange System Objects\Monitoring Mailboxes Example: The logic HM worker process uses to detect and create monitoring mailboxes depends on Exchange Cumulative Update (CU) installed, Exchange role installed on the box and mailbox databases present. The Following logic was used to create monitoring mailboxes for Exchange Server 2013 servers between RTM to Cumulative Update 5: One monitoring mailbox per mailbox database copy, plus one for all of CAS servers. Here’s an example of monitoring mailboxes created on Exchange Server 2013 SP1 server that hosts both CAS and mailbox role: [PS] C:\>Get-Mailbox -Monitoring | ft displayname,whencreated DisplayName WhenCreated ----------- ----------- HealthMailboxb285a119be6649b3a89574f078e947f5 11/10/2014 9:07:29 AM HealthMailbox60d8a8d1285e41bfa5ce1ef1fb93d14e 11/10/2014 9:07:36 AM The display name of the monitoring mailbox created for database copy contained the GUID of the mailbox database for which it was created. Example: The display name of the monitoring mailbox created for CAS server contained the GUID of the Exchange server for which it was created. Example: The following logic is used to create monitoring mailbox for Exchange Server 2013 Cumulative Update 6 and onwards: One monitoring mailbox for each mailbox database copy hosted on mailbox role, plus ten monitoring mailboxes for each CAS server. The following naming convention is used for the display name of the monitoring mailbox created for a database: “HealthMailbox-“ + host name of server + “-“ + database name. At the startup, the HM worker checks for name of databases present on the server and then checks for presence of monitoring mailboxes with display name as explained above. If it doesn’t find a monitoring mailbox for a specific database, it creates a new monitoring mailbox. That means, if you rename DB1 to DB2, HM worker will create new monitoring mailbox for DB2 on next startup. And the following naming convention is used for display name of the monitoring mailbox created for a CAS server: “HealthMailbox-“ + host name of the CAS server + “-001” through “010.” We attempt to distribute the monitoring mailboxes created for CAS servers across the mailbox databases available. Let’s make this real: Imagine that you have only one Exchange server running Exchange 2013 CU6 or newer in the org (server is named EXCH1) that hosts both CAS and Mailbox role and has only one mailbox database, 11 monitoring mailboxes will be created. Example: The Health Manager Service will create more mailboxes according to the logic explained above as you go on adding more databases or server roles. Password resets HM Worker is responsible for maintaining the password for Monitoring mailboxes. HM worker uses a complex algorithm to generate password to be used for monitoring mailbox. The password for monitoring mailbox is reset under the following conditions: A new health mailbox is being created Each time HM Worker process starts and is not able to retrieve the existing password for the monitoring mailbox Any other scenario where HM Worker is not able to get hold of existing password for the monitoring mailbox Best practices Here are some best practices regarding management of user accounts associated with monitoring mailboxes as well as mailboxes themselves: Do not apply third party customized password policies to user accounts of monitoring mailboxes Exclude monitoring mailboxes from user account lockout policies Do not move the user accounts out from the Monitoring Mailboxes container Do not change the user account properties, like restricting password change etc. Do not change AD permission inheritance Since HM worker handles password resets for monitoring mailboxes, in a large environment, it is normal to see increased password reset traffic for monitoring mailbox accounts; note that doing one of the things above might increase the frequency of those resets Do not move the monitoring mailboxes between mailbox databases Do not apply mailbox quotas to monitoring mailboxes If applying a retention policy, ensure the data within the monitoring mailbox is retained for a minimum of 30 days before being deleted If you see a mailbox size increasing significantly for specific monitoring mailbox; you can simply disable the specific mailbox. The HM worker will create a new mailbox at next startup. Common tasks with monitoring mailboxes How to list monitoring mailboxes The Get-Mailbox command is provided with special parameter “-Monitoring” to list only the monitoring mailboxes. Here are some examples: To list all monitoring mailboxes present in the organization: Get-Mailbox –Monitoring To list monitoring mailboxes present on a database: Get-Mailbox –Monitoring -Database <database name> However, the mailboxes listed above may not be associated with the server on which database is hosted. As explained in creation logic, the name of Client Access Server or the mailbox database is important in searching for monitoring mailbox associated and in creation of monitoring mailbox. Use the following command to list the monitoring mailboxes associated with a specific server: Get-Mailbox -Monitoring | ?{$_.DisplayName -like "*-<servername>-*"} Example: Get-Mailbox -Monitoring | ?{$_.DisplayName -like "*-exch2-*"} | ft name,displayname,database Use the following command to list the monitoring mailbox associated with a specific database: Get-Mailbox -Monitoring | ?{$_.DisplayName -like "*-<name of database>"} Example: Get-Mailbox -Monitoring | ?{$_.DisplayName -like "*-db2exch2"} | ft name,displayname,database,servername Troubleshooting tips Here are some troubleshooting methods for monitoring mailboxes. How to re-create monitoring mailboxes (NOT considered regular maintenance!) The mailbox database removal doesn’t cleanup the AD user account associated with monitoring mailboxes. This can then result in orphaned AD user accounts. This happens because of deny permission inherited on Monitoring Mailboxes container. KB article 3046530has details on this, as well as the workaround to resolve it. If there are too many orphaned monitoring mailbox accounts, you may want to re-create them. Steps: 1) Make sure “Monitoring Mailboxes” container is present Open Active Directory Users & Computers Click on View and select “Advanced Features” The Browse to Microsoft Exchange System Objects Verify the presence of the “Monitoring Mailboxes” container. Example: If the Monitoring Mailboxes container is missing: Make sure you have Exchange Server 2013 CU1 or above installed. Perform PrepareAD with the Exchange Server 2013 version installed. 2) Stop the “Microsoft Exchange Server Health Manager” service on all Exchange Server 2013 servers. 3) Open Exchange Management Shell and use following command to disable existing health mailboxes: Get-Mailbox -Monitoring | Disable-Mailbox 4) Go back to Active Directory users & computers, right click on domain and search for “HealthMailbox” 5) Delete the health mailbox user accounts. 6) Wait for AD replication or force AD replication. 7) Start the “Microsoft Exchange Server Health Manager” on all Exchange Server 2013 servers. Bhalchandra Atre Updates 6/2/15: Updated best practices to include information on aging out data via retention policies.314KViews1like17CommentsManaged Availability Responders
Responders are the final critical part of Managed Availability. Recall that Probes are how Monitors obtain accurate information about the experience your users are receiving. Responders are what the Monitors use to attempt to fix the situation. Once they pass throttling, they launch a recovery action such as restarting a service, resetting an IIS app pool, or anything else the developers of Exchange have found often resolve the symptoms. Refer to the Responder Timeline section of the Managed Availability Monitors article for information about when the Responders are executed. Definitions and Results Just like Probes and Monitors, Responders have an event log channel for their definitions and another for their results. The definitions can be found in Microsoft-Exchange-ActiveMonitoring/ResponderDefinition. Some of the important properties are: TypeName: The full code name of the recovery action that will be taken when this Responder executes. Name: The name of the Responder. ServiceName: The HealthSet this Responder is part of. TargetResource: The object this Responder will act on. AlertMask: The Monitor for this Responder. ThrottlePolicyXml: How often this Responder is allowed to execute. I’ll go into more details in the next section. The results can be found in Microsoft-Exchange-ActiveMonitoring/ResponderResult. Responders output a result on a recurring basis whether or not the Monitor indicates they should take a recovery action. If a ResponderResult event has a RecoveryResult of 2 and IsRecoveryAttempted of 1, the Responder attempted a recovery action. Usually, you will want to instead skip looking at the Responder results and go straight to Microsoft-Exchange-ManagedAvailability/RecoveryActionResults, but let’s first discuss the events in the Microsoft-Exchange-ManagedAvailability/RecoveryActionLogs event log channel. Throttling When a recovery action is attempted by a Responder, it is first checked against throttling limits. This will result in one of two events in the RecoveryActionLogs channel: 2050, throttling has allowed the operation, or 2051, throttling rejected the operation. Here’s a sample of a 2051 event: In the details, you will see: ActionId RestartService ResourceName MSExchangeRepl RequesterName ServiceHealthMSExchangeReplEndpointRestart ExceptionMessage Active Monitoring Recovery action failed. An operation was rejected during local throttling. (ActionId=RestartService, ResourceName=MSExchangeRepl, Requester=ServiceHealthMSExchangeReplEndpointRestart, FailedChecks=LocalMinimumMinutes, LocalMaxInDay) LocalThrottleResult <LocalThrottlingResult IsPassed="false" MinimumMinutes="60" TotalInOneHour="1" MaxAllowedInOneHour="-1" TotalInOneDay="1" MaxAllowedInOneDay="1" IsThrottlingInProgress="true" IsRecoveryInProgress="false" ChecksFailed="LocalMinimumMinutes, LocalMaxInDay" TimeToRetryAfter="2015-02-11T14:29:57.9448377-08:00"> <MostRecentEntry Requester="ServiceHealthMSExchangeReplEndpointRestart" StartTime="2015-02-10T14:29:55.9920032-08:00" EndTime="2015-02-10T14:29:57.9448377-08:00" State="Finished" Result="Succeeded" /> </LocalThrottlingResult> GroupThrottleResult <not attempted> TotalServersInGroup 0 TotalServersInCompatibleVersion 0 Hopefully, you recognize the first few fields. This is the RestartService recovery action, which restarts a service. The ResourceName is used by the recovery action to pick a target; for the RestartService recovery action, it is the name of the service to restart. The RequesterName is the name of the Responder, as listed in the ResponderDefinition or ResponderResult channels. The LocalThrottleResult property is more interesting. Recovery actions are throttled per server, where the same recovery action cannot run too often on the same server, and per group, where the same recovery action cannot run too often on the same DAG (for the Mailbox role) or AD site (for the Client Access role). If a value is -1, this level of throttling is not used; for example, MaxAllowedInOneHour is not interesting if only 1 action is allowed per day. In this example, the MSExchangeRepl resource was already the target of a recovery action within the last 60 minutes, and so the recovery action did not pass the LocalMinimumMinutes throttling. As this recovery action attempt was blocked by local throttling, the group throttling was not attempted. This table has a description of each of the limits mentioned in this event: ThrottlingResult attribute Local throttle config attribute name Group throttle config attribute name Description IsPassed True if throttling will allow the recovery action. Otherwise, false. MinimumMinutes, LocalMinimumMinutes, GroupMinimumMinutes LocalMinimumMinutesBetweenAttempts GroupMinimumMinutesBetweenAttempts The time that must elapse before this recovery action may act upon the same resource on this server or in this group. TotalInOneHour The number of times this recovery action has acted upon this resource on this server or in this group in the last hour. MaxAllowedInOneHour, LocalMaxInHour LocalMaximumAllowedAttemptsInOneHour n/a The number of times this recovery action is allowed to act upon this resource on this server or in this group in one hour. TotalInOneDay The number of times this recovery action has acted upon this resource on this server or in this group in the last 24 hours. MaxAllowedInOneDay, LocalMaxInDay, GroupMaxInDay LocalMaximumAllowedAttemptsInADay GroupMaximumAllowedAttemptsInADay The number of times this recovery action is allowed to act upon this resource on this server or in this group in 24 hours. IsRecoveryInProgress, RecoveryInProgress, GroupRecoveryInProgress Whether this recovery action is already acting upon this resource and has not completed. If True, the new action will be aborted. TimeToRetryAfter The time after which this recovery action would be allowed to act on this resource on this server or in this group. The GroupThrottleResult has the same fields, and also gives details about the recovery actions that have taken place on the other servers in the group. If the action is not throttled, event 500 will be logged in the Microsoft-Exchange-ManagedAvailability/RecoveryActionResults channel, indicating that the recovery action is beginning. If it succeeds, event 501 is logged. This is the most common case and where you’ll usually want to start. These events also have details about the recovery action that was taken and the throttling it passed. Recovery actions that start and then fail are still counted against throttling limits. For more information about recovery actions, read the What Did Managed Availability Just Do to This Service? article. Viewing Throttling Limits So what is the best way to find out what recovery action throttling is in place? You could wait for the Responder to begin a recovery action and view the throttling settings in the RecoveryActionsLogs channel, but there are two places that will be more timely. The first is the Microsoft-Exchange-ManagedAvailability\ThrottlingConfig event log channel. The second is the Microsoft-Exchange-ActiveMonitoring/ResponderDefinition channel, introduced in the first section of this artcile. The advantage of the ThrottlingConfig channel is that you can see all the Responders that can take a particular recovery action grouped together, instead of having to check every Responder definition. Here’s a sample event from the ThrottlingConfig event log channel: Identity RestartService/Default/*/*/msexchangefastsearch RecoveryActionId RestartService ResponderCategory Default ResponderTypeName * ResponderName * ResourceName msexchangefastsearch PropertiesXml <ThrottleConfig Enabled="True" LocalMinimumMinutesBetweenAttempts="60" LocalMaximumAllowedAttemptsInOneHour="-1" LocalMaximumAllowedAttemptsInADay="4" GroupMinimumMinutesBetweenAttempts="-1" GroupMaximumAllowedAttemptsInADay="-1" /> The Identity of a throttling configuration is a concatenation of the next five fields, so let’s discuss each. The RecoveryActionId is the Responder’s throttling type. You can find this as the name of the ThrottleEntries node in the Responder definition’s ThrottlePolicyXml property. The ResponderCategory is unused and is always Default right now. The ResponderTypeName is the Responder’s TypeName property. The ResourceName is the object the Responder acts on. In this example, the throttling for Responders that use the RestartService recovery action to restart the MSExchangeFastSearch process are allowed on any server up to 4 times a day, as long as it has been 60 minutes since this recovery action has restarted it on that server. The group throttling is not used. The second method to view throttling limits is by the Microsoft-Exchange-ActiveMonitoring/ResponderDefinition events. This will include any overrides you have in place. Here is the value of the ThrottlePolicyXml property from a ResponderDefinition event: <ThrottleEntries> <RestartService ResourceName="MSExchangeFastSearch"> <ThrottleConfig Enabled="True" LocalMinimumMinutesBetweenAttempts="60" LocalMaximumAllowedAttemptsInOneHour="-1" LocalMaximumAllowedAttemptsInADay="4" GroupMinimumMinutesBetweenAttempts="-1" GroupMaximumAllowedAttemptsInADay="-1" /> </RestartService> </ThrottleEntries> You can see that these attribute names and values match the ThrottlingConfig event’s PropertiesXml values. Changing Throttling Limits There may be times when you want recovery actions to occur more frequently or less frequently. For example, you have a customer report of an outage and you find that a service restart would have fixed it but was throttled, or you have a third-party application that does particularly poorly with application pool resets. To change the throttling configuration, you can use the same Add-ServerMonitoringOverride and Add-GlobalMonitoringOverride cmdlets that work for other Managed Availability overrides. The Customizing Managed Availability article gives a good summary on using these cmdlets. For the PropertyName parameter, the cmdlet supports a special syntax for modifying the throttling configuration. Instead of specifying the entire XML blob as the override (which will work, but will be harder to read later), you can use ThrottleAttributes.LocalMinimumMinutesBetweenAttempts, or the other properties, as the PropertyName. Here’s an example: Add-GlobalMonitoringOverride -ItemType Responder -Identity Search\SearchIndexFailureRestartSearchService –PropertyName ThrottleAttributes.LocalMinimumMinutesBetweenAttempts -PropertyValue 240 -ApplyVersion "15.00.1044.025" To only allow app pool resets by the ActiveSyncSelfTestRestartWebAppPool Responder every 2 hours instead of 1, you could use the command: Add-GlobalMonitoringOverride -ItemType Responder -Identity ActiveSync.Protocol\ActiveSyncSelfTestRestartWebAppPool -PropertyName ThrottleAttributes.LocalMinimumMinutesBetweenAttempts -PropertyValue 120 -ApplyVersion “Version 15.0 (Build 1044.25)” If you want you servers to reboot when the MSExchangeIS service crashes and cannot start at the rate of all of your servers once a day and no more often than one in the DAG every 60 minutes, you could use the commands: Add-GlobalMonitoringOverride -ItemType Responder -Identity Store\StoreServiceKillServer -PropertyName ThrottleAttributes.GroupMinimumMinutesBetweenAttempts -PropertyValue 60 -ApplyVersion “15.00.1044.025” Add-GlobalMonitoringOverride -ItemType Responder -Identity Store\StoreServiceKillServer -PropertyName ThrottleAttributes.GroupMaximumAllowedAttemptsInADay -PropertyValue -1 -ApplyVersion “15.00.1044.025” The LocalMaximumAllowedAttemptsInADay value is already 1, so each server woul d still reboot at most once per day. If the override was entered correctly, the ResponderDefinition event’s ThrottlePolicyXml value will be updated, and there will be a new entry in the ThrottlingConfig channel. These may be poor examples, but it is hard to pick good ones as the Exchange developers pick values for the throttling configuration based on our experience running Exchange in Office 365. We don’t expect that changing these values is going to be something you’ll want to do very often, but it is usually a better idea than disabling a monitor or a recovery action altogether. If you do have a scenario where you need to keep a throttling limit override in place, we would love to hear about it. Abram Jackson Program Manager, Exchange Server21KViews0likes2CommentsManaged Availability Probes
Probes are one of the three critical parts of the Managed Availability framework (monitors and responders are the other two). As I wrote previously, monitors are the central components, and you can query monitors to find an up-to-the-minute view of your users’ experience. Probes are how monitors obtain accurate information about that experience. There are three major categories of probes: recurrent probes, notifications, and checks. Recurrent Probes The most common probes are recurrent probes. Each probe runs every few minutes and checks some aspect of service health. They may transmit an e-mail to a monitoring mailbox using Exchange ActiveSync, connect to an RPC endpoint, or establish CAS-to-Mailbox server connectivity. All of these probes are defined in the Microsoft.Exchange.ActiveMonitoring\ProbeDefinition event log channel each time the Exchange Health Manager service is started. The most interesting properties for these events are: Name: The name of the Probe. This will begin with the SampleMask of the Probe’s Monitor. TypeName: The code object type of the probe that contains the probe’s logic. ServiceName: The name of the Health Set for this Probe. TargetResource: The object this Probe is validating. This is appended to the Name of the Probe when it is executed to become a Probe Result ResultName RecurrenceIntervalSeconds: How often this Probe executes. TimeoutSeconds: How long this Probe should wait before failing. On a typical Exchange 2013 multi-role server, there are hundreds of these probes defined. Many probes are per-database, so this number will increase quickly as you add databases. In most cases, the logic in these probes is defined in code, and not directly discoverable. However, there are two probe types that are common enough to describe in detail, based on the TypeName of the probe: Microsoft.Exchange.Monitoring.ActiveMonitoring.ServiceStatus.Probes.GenericServiceProbe: Determines whether the service specified by TargetResource is running. Microsoft.Exchange.Monitoring.ActiveMonitoring.ServiceStatus.Probes.EventLogProbe: Logs an error result if the event specified by ExtensionAttributes.RedEventIds has occurred in the ExtensionAttributes.LogName. Success results are logged if the ExtensionAttributes.GreenEventIds is logged. These probes will not work if you override them to watch for a different event. The basics of a recurrent probe are as follows: start every RecurrenceIntervalSeconds and check (or probe) some aspect of component health. If the component is healthy, the probe passes and writes an informational event to the Microsoft.Exchange.ActiveMonitoring\ProbeResult channel with a ResultType of 3. If the check fails or times out, the probe fails and writes an error event to the same channel. A ResultType of 4 means the check failed and a ResultType of 1 means that it timed out. Many probes will re-run if they timeout, up to the MaxRetryAttempts property. The ProbeResult channel gets very busy with hundreds of probes running every few minutes and logging an event, so there can be a real impact on the performance of your Exchange server if you perform expensive queries against this event channel in a production environment. Notifications Notifications are probes that are not run by the health manager framework, but by some other service on the server. These services perform their own monitoring, and then feed data into the Managed Availability framework by directly writing probe results. You will not see these probes in the ProbeDefinition channel, as this channel only describes probes that are run within the Managed Availability framework. For example, the ServerOneCopyMonitor Monitor is triggered by Probe results written by the MSExchangeDagMgmt service. This service performs its own monitoring, determines whether there is a problem, and logs a probe result. Most Notification probes have the capability to log both a red event that turns the Monitor Unhealthy and a green event that make the Monitor healthy once more. Checks Checks are probes that only log events when a performance counter passes above or below a defined threshold. They are really a special type of Notification probe, as there is a service monitoring the performance counters on the server and logging events to the ProbeResult channel when the configured threshold is met. To find the counter and threshold that is considered unhealthy, you can look at Monitor Definitions with a Type property of: · Microsoft.Office.Datacenter.ActiveMonitoring.OverallConsecutiveSampleValueAboveThresholdMonitor or · Microsoft.Office.Datacenter.ActiveMonitoring.OverallConsecutiveSampleValueBelowThresholdMonitor This means that the probe the Monitor watches is a Check probe. How this works with Monitors From the Monitor’s perspective, all three probe types are the same as they each log to the ProbeResult channel. Every Monitor has a SampleMask property in its definition. As the Monitor executes, it looks for events in the ProbeResult channel that have a ResultName that matches the Monitor’s SampleMask. These events could be from recurrent probes, notifications, or checks. If the Monitor’s thresholds are reached or exceeded, it becomes Unhealthy. It is worth noting that a single probe failure does not necessarily indicate that something is wrong with the server. It is the design of Monitors to correctly identify when there is a real problem that needs fixing versus a transient issue that resolves itself or was anomalous. This is why many Monitors have thresholds of multiple probe failures before becoming Unhealthy. Even many of these problems can be fixed automatically by Responders, so the best place to look for problems that require manual intervention is in the Microsoft.Exchange.ManagedAvailability\Monitoring crimson channel. These events sometimes also include the most recent probe error message (if the developers of that Health Set view it as relevant when they get paged with that event’s text in Office 365). There are more details on how Monitors work, and how they can be overridden to use different thresholds in the Managed Availability Monitors article. Abram Jackson Program Manager, Exchange Server23KViews0likes4CommentsExchange 2013 database schema updates
Recently, we have seen some questions about what the Update-DatabaseSchema cmdlet in Exchange 2013 is about. So I thought I would share some additional information on the subject. The Update-DatabaseSchema cmdlet is a part of the infrastructure that we’ve built into Exchange 2013 to safely upgrade database schema in a DAG deployment. Unlike previous releases, a database schema upgrade in Exchange 2013 can only occur after all DAG members are upgraded to a version of software that supports the schema version and there is control over when the schema upgrade occurs (setting RequestedDatabaseSchemaVersion to a value higher than CurrentSchemaVersion up to the MaximumSupportableDatabaseSchemaVersion supported by all members of DAG). The RequestedDatabaseSchemaVersion of a database cannot be incremented to a value higher than the minimum of MaximumSupportableDatabaseSchemaVersion supported by any DAG member. This design prevents issues like those encountered during upgrade Exchange 2010 DAG members to post-RTM service packs and automatically upgrading database schema version when mounting database on an upgraded server (no longer able to mount on a server that has not yet been upgraded). The initial database schema version will be based on the server version(s) deployed in the DAG. The Exchange 2013 RTM database schema version is 0.121 and can be displayed using get-MailboxDatabase or get-MailboxDatabaseCopyStatus in CU2 and later. MaximumSupportableDatabaseSchemaVersion has incremented in each CU release, so databases created with server versions after RTM may be created with a schema version higher than 0.121. Prior to CU3, the Update-DatabaseSchema cmdlet could be used to manually set RequestedDatabaseSchemaVersion value higher than CurrentSchemaVersion (version at database creation). In CU3, setup (during build-to-build upgrade) was modified to automatically request upgrade of database schema version on existing databases to MaximumSupportableDatabaseSchemaVersion (0.126) for databases created with a lower schema version. By design, the attempt to set RequestedDatabaseSchemaVersion to 0.126 only succeeds when the last member of DAG is upgraded to CU3. All database schema upgrades are serial and are performed at mount time after a RequestedDatabaseSchemaVersion value is set, so an upgrade from 0.121 (RTM) to 0.126 (CU3) will involve 5 distinct schema upgrades (transactions). It should be noted that database schema upgrades only involve global tables in a database. There is also a schema associated with tables associated with each mailbox and a mailbox schema upgrade can modify any table associated with a mailbox. After a database schema upgrade is performed during database mount, corresponding mailbox schema upgrades can be performed on subsequent logon to a mailbox. The schema version of a mailbox can be displayed using get-MailboxStatistics and will match the database schema version after first logon occurs after database schema upgrade completes. We internally have the ability to control the MaximumSupportedDatabaseSchemaVersion for each target environment (test, dogfood, production service, on-premises) where an Exchange Server can be deployed explicitly and only increment the value supported in an environment in a given build after we have built high confidence with that change. We progressively built high confidence in the safety of performing a schema upgrade in our test, dogfood and Exchange Online environments and completed in-place database schema upgrades in Exchange Online prior to shipping CU3. It was this validation in our production service that led to the decision to enable this automated upgrade for our on-premises customers so that they could begin to reap the benefits enabled by these schema changes. This same validation will be performed for any schema upgrades included with future CU/SP releases. You might ask yourself at this point: what are those benefits? Since the release of Exchange 2013, we have used database schema upgrades to help tweak performance on the database level, and envision that we will continue to do so in the future. Another thing to note is that we will not be automatically incrementing the version at every release (a cumulative update or a service pack) but will change the schema only when there is a specific benefit to be had. The following shows the cmdlets that can be used to display the schema versions supported by servers hosting each database copy, the schema version of each database, and schema version of each mailbox. [PS] D:\data\scripts>$identity = "forest noll" [PS] D:\data\scripts>$m = get-mailbox $identity [PS] D:\data\scripts>Get-MailboxDatabaseCopyStatus $m.database | FL Identity,status,*schema* Identity : D12 MBX Store 18\15M31 Status : Mounted MinimumSupportedDatabaseSchemaVersion : 0.121 MaximumSupportedDatabaseSchemaVersion : 0.126 RequestedDatabaseSchemaVersion : Identity : D12 MBX Store 18\D15M41 Status : Healthy MinimumSupportedDatabaseSchemaVersion : 0.121 MaximumSupportedDatabaseSchemaVersion : 0.126 RequestedDatabaseSchemaVersion : Identity : D12 MBX Store 18\15M30 Status : Healthy MinimumSupportedDatabaseSchemaVersion : 0.121 MaximumSupportedDatabaseSchemaVersion : 0.126 RequestedDatabaseSchemaVersion : Identity : D12 MBX Store 18\D15M40 Status : ServiceDown MinimumSupportedDatabaseSchemaVersion : MaximumSupportedDatabaseSchemaVersion : RequestedDatabaseSchemaVersion : [PS] D:\data\scripts>Get-MailboxDatabase $m.database -status | FL *schema* CurrentSchemaVersion : 0.126 RequestedSchemaVersion : 0.126 [PS] D:\data\scripts>Get-MailboxStatistics $m | FL *schema* CurrentSchemaVersion : 0.126 Hopefully this helps understanding what this is for! Todd Luttinen44KViews0likes17CommentsCustomizing Managed Availability
Exchange Server 2013 introduces a new feature called Managed Availability, which is a built-in monitoring system with self-recovery capabilities. If you’re not familiar with Managed Availability, it’s a good idea to read these posts: Lessons from the Datacenter: Managed Availability What Did Managed Availability Just Do To This Service? As described in the above posts, Managed Availability performs continuous probing to detect possible problems with Exchange components or their dependencies, and it performs recovery actions to make sure the end user experience is not impacted due to a problem with any of these components. However, there may be scenarios where the out-of-box settings may not be suitable for your environment. This blog post guides you on how to examine the default settings and modify them to suit your environment. Managed Availability Components Let’s start by finding out which health sets are on an Exchange server: Get-HealthReport -Identity Exch2 This produces the output similar to the following: Next, use Get-MonitoringItemIdentity to list out the probes, monitors, and responders related to a health set. For example, the following command lists the probes, monitors, and responders included in the FrontendTransport health set: Get-MonitoringItemIdentity -Identity FrontendTransport -Server exch1 | ft name,itemtype –AutoSize This produces output similar to the following: You might notice multiple probes with same name for some components. That’s because Managed Availability creates a probe for each resource. In following example, you can see that OutlookRpcSelfTestProbe is created multiple times (one for each mailbox database present on the server). Use Get-MonitoringItemIdentity to list the monitoring Item Identities along with the resource for which they are created: Get-MonitoringItemIdentity -Identity Outlook.Protocol -Server exch1 | ft name,itemtype,targetresource –AutoSize Customize Managed Availability Managed Availability components (probes, monitors and responders) can be customized by creating an override. There are two types of override: local override and global override. As their names imply, a local override is available only on the server where it is created, and a global override is used to deploy an override across multiple servers. Either override can be created for a specific duration or for a specific version of servers. Local Overrides Local overrides are managed with the *-ServerMonitoringOverride set of cmdlets. Local overrides are stored under following registry path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\ActiveMonitoring\Overrides\ The Microsoft Exchange Health Management service reads this registry path every 10 minutes and loads configuration changes. Alternatively, you can restart the service to make the change effective immediately. You would usually create a local override to: Customize a managed availability component that is server-specific and not available globally; or Customize a managed availability component on a specific server. Global Overrides Global overrides are managed with the *-GlobalMonitoringOverride set of cmdlets. Global overrides are stored in the following container in Active Directory: CN=Overrides,CN=Monitoring Settings,CN=FM,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=Contoso,DC=com Get Configuration Details The configuration details of most of probes, monitors, and responders are stored in the respective crimson channel event log for each monitoring item identity, you would examine these first before deciding to change. In this example, we will explore properties of a probe named “OnPremisesInboundProxy”, which is part of the FrontendTransport Health Set. The following script lists detail of the OnPremisesInboundProxy probe: (Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like "OnPremisesInboundP roxy"} You can also use Event Viewer to get the details of probe definition. The configuration details most of the Probes are stores in the ProbeDefinition channel: Open Event Viewer, and then expand Applications and Services Logs\Microsoft\Exchange\ActiveMonitoring\ProbeDefinition. Click on Find, and then enter OnPremisesInboundProxy. The General tab does not show much detail, so click on the Details tab, it has the configuration details specific to this probe. Alternatively, you can copy the event details as text and paste it into Notepad or your favorite editor to see the details. Override Scenarios Let’s look at couple real-life scenarios and apply our learning so far to customize managed availability to our liking, starting with local overrides. Creating a Local Override In this example, an administrator has customized one of the Inbound Receive connectors by removing the binding of loopback IP address. Later, they discover that the FrontEndTransport health set is unhealthy. On further digging, they determine that the OnPremisesInboundProxy probe is failing. To figure out why the probe is failing, you can first list the configuration details of OnPremisesInboundProxy probe. (Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like "OnPremisesInboundProxy"} Name : OnPremisesInboundProxy WorkItemVersion : [null] ServiceName : FrontendTransport DeploymentId : 0 ExecutionLocation : [null] CreatedTime : 2013-08-06T12:54:29.7571195Z Enabled : 1 TargetPartition : [null] TargetGroup : [null] TargetResource : [null] TargetExtension : [null] TargetVersion : [null] RecurrenceIntervalSeconds : 300 TimeoutSeconds : 60 StartTime : 2013-08-06T12:54:36.7571195Z UpdateTime : 2013-08-06T12:48:27.1418660Z MaxRetryAttempts : 1 ExtensionAttributes : <ExtensionAttributes><WorkContext><SmtpServer>127.0.0.1</SmtpServer><Port>25</Port><HeloDomain>InboundProxyProbe</HeloDomain><MailFrom Username="inboundproxy@contoso.com"/><MailTo Select="All" Username="HealthMailboxdd618748368a4935b278e884fb41fd8a@FM.com"/><Data AddAttributions="false">X-Exchange-Probe-Drop-Message:FrontEnd-CAT-250 Subject:Inbound proxy probe</Data><ExpectedConnectionLostPoint>None</ExpectedConnectionLostPoint></WorkContext></ExtensionAttributes> The ExtentionAttributes property above shows that the probe is using 127.0.0.1 for connecting to port 25. As that is the loopback address, the administrator needs to change the SMTP server in ExtentionAttributes property to enable the probe to succeed. You use following command to create a local override, and change the SMTP server to the hostname instead of loopback IP address. Add-ServerMonitoringOverride -Server ServerName -Identity FrontEndTransport\OnPremisesInboundProxy -ItemType Probe -PropertyName ExtensionAttributes -PropertyValue '<ExtensionAttributes><WorkContext><SmtpServer>Exch1.contoso.com</SmtpServer><Port>25</Port><HeloDomain>InboundProxyProbe</HeloDomain><MailFrom Username="inboundproxy@contoso.com" /><MailTo Select="All" Username="HealthMailboxdd618748368a4935b278e884fb41fd8a@FM.com" /><Data AddAttributions="false">X-Exchange-Probe-Drop-Message:FrontEnd-CAT-250Subject:Inbound proxy probe</Data><ExpectedConnectionLostPoint>None</ExpectedConnectionLostPoint></WorkContext></ExtensionAttributes>' -Duration 45.00:00:00 The probe will be created on the specified server in following registry path: HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ExchangeServer\v15\ActiveMonitoring\Overrides\Probe You can use following command to verify if the probe is effective: (Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like "OnPremisesInboundProxy"} Creating a Global Override In this example, the organization has an EWS application that is keeping the EWS app pools busy for complex queries. The administrator discovers that the EWS App pool is recycled during long running queries, and that the EWSProxyTestProbe probe is failing. To find out the details of EWSProxyTestProbe, run the following: (Get-WinEvent -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ?{$_.Name -like "EWSProxyTestProbe"} Next, change the timeout interval for EWSProxyTestProbe to 25 seconds on all servers running Exchange Server 2013 RTM CU2: Use following command to get version information for Exchange 2013 RTM CU2 servers: Get-ExchangeServer | ft name,admindisplayversion Use following command to create a new global override: Add-GlobalMonitoringOverride -Identity "EWS.Proxy\EWSProxyTestProbe" -ItemType Probe -PropertyName TimeoutSeconds -PropertyValue 25 –ApplyVersion “15.0.712.24” Override Durations Either of above mentioned overrides can be created for a specific Duration or for a specific Version of Exchange servers. The override created with Duration parameter will be effective only for the period mentioned. And maximum duration that can be specified is 60 days. For example, an override created with Duration 45.00:00:00 will be effective for 45 days since the time of creation. The Version specific override will be effective as long as the Exchange server version matches the value specified. For example, an override created for Exchange 2013 CU1, with version “15.0.620.29” will be effective until the Exchange server version changes. The override will be ineffective if the Exchange server is upgraded to different version of Cumulative Update or Service Pack. Hence, if you need an override in effect for longer period, create the override using ApplyVersion parameter. Removing an Override Finally, this last example shows how to remove the local override that was created for the OnPremisesInbounxProxy probe. Remove-ServerMonitoringOverride -Server ServerName -Identity FrontEndTransport\OnPremisesInboundProxy -ItemType Probe -PropertyName ExtensionAttributes Conclusion Managed Availability performs gradual recovery actions to automatically recover from failure scenarios. The overrides help you customize the configuration of the Managed Availability components to suite to your environment. The steps mentioned in the document can be used to customize Monitors & Probes as required. Special thanks to Abram Jackson, Scott Schnoll, Ben Winzenz, and Nino Bilic for reviewing this post. Bhalchandra Atre64KViews0likes6CommentsManaged Availability and Server Health
Every second on every Exchange 2013 server, Managed Availability polls and analyzes hundreds of health metrics. If something is found to be wrong, most of the time it will be fixed automatically. But of course there will always be issues that Managed Availability won’t be able to fix on its own. In those cases, Managed Availability will escalate the issue to an administrator by means of event logging, and perhaps alerting if System Center Operations Manager is used in tandem with Exchange 2013. When an administrator needs to get involved and investigate the issue, they can begin by using the Get-HealthReport and Get-ServerHealth cmdlets. Server Health Summary Start with Get-HealthReport to find out the status of every Health Set on the server: Get-HealthReport –Identity <ServerName> This will result in the following output (truncated for brevity): Server State HealthSet AlertValue LastTransitionTime MonitorCount ------ ------ ------ ------ ------ ------ Server1 NotApplicable AD Healthy 5/21/2013 12:23 14 Server1 NotApplicable ECP Unhealthy 5/26/2013 15:40 2 Server1 NotApplicable EventAssistants Healthy 5/29/2013 17:51 40 Server1 NotApplicable Monitoring Healthy 5/29/2013 17:21 9 … … … … … … In the above example, you can see that that the ECP (Exchange Control Panel) Health Set is Unhealthy. And based on the value for MonitorCount, you can also see that the ECP Health Set relies on two Monitors. Let's find out if both of those Monitors are Unhealthy. Monitor Health The next step would be to use Get-ServerHealth to determine which of the ECP Health Set Monitors are in an unhealthy state. Get-ServerHealth –Identity <ServerName> –HealthSet ECP This results in the following output: Server State Name TargetResource HealthSetName AlertValue ServerComponent ------ ------ ------ ------ ------ ------ ------ Server1 NotApplicable EacSelfTestMonitor ECP Unhealthy None Server1 NotApplicable EacDeepTestMonitor ECP Unhealthy None As you can see above, both Monitors are Unhealthy. As an aside, if you pipe the above command to Format-List, you can get even more information about these Monitors. Troubleshooting Monitors Most Monitors are one of these four types: The EacSelfTestMonitor Probes along the "1" path, while the EacDeepTestMonitor Probes along the "4" path. Since both are unhealthy, it indicates that the problem lies on the Mailbox server in either the protocol stack or the store. It could also be a problem with a dependency, such as Active Directory, which is common when multiple Health Sets are unhealthy. In this case, the Troubleshooting ECP Health Set topic would be the best resource to help diagnose and resolve this issue. Abram Jackson Program Manager, Exchange Server39KViews0likes3CommentsWhat Did Managed Availability Just Do To This Service?
We in the Exchange product group get this question from time to time. The first thing we ask in response is always, “What was the customer impact?” In some cases, there is customer impact; these may indicate bugs that we are motivated to fix. However, in most cases there was no customer impact: a service restarted, but no one noticed. We have learned while operating the world’s largest Exchange deployment that it is fantastic when something is fixed before customers even notice. This is so desirable that we are willing to have a few extra service restarts as long as no customers are impacted. You can see this same philosophy at work in our approach to database failovers since Exchange 2007. The mantra we have come to repeat is, “Stuff breaks, but the user experience doesn’t!” User experience is our number one priority at all times. Individual service uptime on a server is a less important goal, as long as the user experience remains satisfactory. However, there are cases where Managed Availability cannot fix the problem. In cases like these, Exchange provides a huge amount of information about what the problem might be. Hundreds of things are checked and tested every minute. Usually, Get-HealthReport and Get-ServerHealth will be sufficient to find the problem, but this blog post will walk you through getting the full details from an automatic recovery action to the results of all the probes by: Finding the Managed Availability Recovery Actions that have been executed for a given service. Determining the Monitor that triggered the Responder. Retrieving the Probes that the Monitor uses. Viewing any error messages from the Probes. Finding Recovery Actions Every time Managed Availability takes a recovery action, such as restarting a service or failing over a database, it logs an event in the Microsoft.Exchange.ManagedAvailability/RecoveryActions crimson channel. Event 500 indicates that a recovery action has begun. Event 501 indicates that the action that was taken has completed. These can be collected via the MMC Event Viewer, but we usually find it more useful to use PowerShell. All of these Managed Availability recovery actions can be collected in PowerShell with a simple command: $RecoveryActionResultsEvents = Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ManagedAvailability/RecoveryActionResults We can use the events in this format, but it is easier to work with the event properties if we use PowerShell’s native XML format: $RecoveryActionResultsXML = ($RecoveryActionResultsEvents | Foreach-object -Process {[XML]$_.toXml()}).event.userData.eventXml Some of the useful properties for this Recovery Action event are: Id: The action that was taken. Common values are RestartService, RecycleAppPool, ComponentOffline, or ServerFailover. State: Whether the action has started (event 500) or finished (event 501). ResourceName: The object that was affected by the action. This will be the name of a service for RestartService actions, or the name of a server for server-level actions. EndTime: The time the action completed. Result: Whether the action succeeded or not. RequestorName: The name of the Responder that took the action. So for example, if you wanted to know why MSExchangeRepl was restarted on your server around 9:30PM, you could run a command like this: $RecoveryActionResultsXML | Where-Object {$_.State -eq "Finished" -and $_.ResourceName –eq "MSExchangeRepl" -and $_.EndTime -like "2013-06-12T21*"}| ft -AutoSize StartTime,RequestorName This results in the following output: StartTime RequestorName --------- ------------- 2013-05-12T21:49:18.2113618Z ServiceHealthMSExchangeReplEndpointRestart The RequestorName property indicates the name of the Responder that took the action. In this case, it was ServiceHealthMSExchangeReplEndpointRestart. Often, the responder name will give you an indication of the problem. Other times, you will want more details. Finding the Monitor that Triggers a Responder Monitors are the central part of Managed Availability. They are the primary means, through Get-ServerHealth and Get-HealthReport, by which an administrator can learn the health of a server. Recall that a Health Set is a grouping of related Monitors. This is why much of our troubleshooting documentation is focused on these objects. It will often be useful to know what Monitors and Health Sets are repeatedly unhealthy in your environment. Every time the Health Manager service starts, it logs events to the Microsoft.Exchange.ActiveMonitoring/ResponderDefinition crimson channel, which we can use to get the properties of the Responders we found in the last step by the RequestorName property. First, we need to collect the Responders that are defined: $DefinedResponders = (Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ActiveMonitoring/ResponderDefinition | % {[xml]$_.toXml()}).event.userData.eventXml One of these Responder Definitions will match the Recovery Action’s RequestorName. The Monitor that controls the Responder we are interested in is defined by the AlertMask property of that Definition. Here are some of the useful Responder Definition properties: TypeName: The full code name of the recovery action that will be taken when this Responder executes. Name: The name of the Responder. TargetResource: The object this Responder will act on. AlertMask: The Monitor for this Responder. WaitIntervalSeconds: The minimum amount of time to wait before this Responder can be executed again. There are other forms of throttling that will also affect this Responder. To get the Monitor for the ServiceHealthMSExchangeReplEndpointRestart Responder, you run: $DefinedResponders | ? {$_.Name –eq "ServiceHealthMSExchangeReplEndpointRestart"} | ft -a Name,AlertMask This results in the following output: Name AlertMask ---- --------- ServiceHealthMSExchangeReplEndpointRestart ServiceHealthMSExchangeReplEndpointMonitor Many Monitor names will give you an idea of what to look for. In this case, the ServiceHealthMSExchangeReplEndpointMonitor Monitor does not tell you much more than the Responder name did. The Technet article on Troubleshooting DataProtection Health Set lists this Monitor and suggests running Test-ReplicationHealth. However, you can also get the exact error messages of the Probes for this Monitor with a couple more commands. Finding the Probes for a Monitor Remember that Monitors have their definitions written to the Microsoft.Exchange.ActiveMonitoring/MonitorDefinition crimson channel. Thus, you can get these in a similar way as the Responder definitions in the last step. You can run: $DefinedMonitors = (Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ActiveMonitoring/MonitorDefinition | % {[xml]$_.toXml()}).event.userData.eventXml Some useful properties of a Monitor definition are: Name: The name of this Monitor. This is the same name reported by Get-ServerHealth. ServiceName: The name of the Health Set for this Monitor. SampleMask: The substring that all Probes for this Monitor will have in their names. IsHaImpacting: Whether this Monitor should be included when HaImpactingOnly is specified by Get-ServerHealth or Get-HealthReport. To get the SampleMask for the identified Monitor, you can run: ($DefinedMonitors | ? {$_.Name -eq ‘ServiceHealthMSExchangeReplEndpointMonitor’}).SampleMask This results in the following output: ServiceHealthMSExchangeReplEndpointProbe Now that we know what Probes to look for, we can search the Probes’ definition channel. Useful properties for Probe Definitions are: Name: The name of the Probe. This will begin with the SampleMask of the Probe’s Monitor. ServiceName: The Health Set for this Probe. TargetResource: The object this Probe is validating. This is appended to the Name of the Probe when it is executed to become a Probe Result ServiceName. RecurrenceIntervalSeconds: How often this Probe executes. TimeoutSeconds: How long this Probe should wait before failing. To get definitions of this Monitor’s Probes, you can run: (Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ActiveMonitoring/ProbeDefinition | % {[XML]$_.toXml()}).event.userData.eventXml | ? {$_.Name -like “ServiceHealthMSExchangeReplEndpointProbe*”} | ft -a Name, TargetResource This results in the following output: Name TargetResource ---- -------------- ServiceHealthMSExchangeReplEndpointProbe/ServerLocator MSExchangeRepl ServiceHealthMSExchangeReplEndpointProbe/RPC MSExchangeRepl ServiceHealthMSExchangeReplEndpointProbe/TCP MSExchangeRepl Remember, not all Monitors use synthetic transactions via Probes. See this blog post for the other ways Monitors collect their information. This Monitor has three Probes that can cause it to become Unhealthy. You’ll see that they are named such that each is named with the Monitor’s SampleMask, but are then differentiated. When getting the Probe Results in the next step, the Probes will also have the TargetResource in their ServiceName. Now that we know all the Probes that could have failed, but we don’t yet know which did or why. Getting Probe Error Messages There are many Probes and they execute often, so the channel where they are logged (Microsoft.Exchange.ActiveMonitoring/ProbeResult) generates a lot of data. There will often only be a few hours of data, but the Probes we are interested in will probably have a few hundred Result entries. Here are some of the Probe Result properties you may be interested in for troubleshooting: ServiceName: The Health Set of this Probe. ResultName: The Name of this Probe, including the Monitor’s SampleMask, an identifier of the code this Probe executes, and the resource it verifies. The target resource is appended to the Probe’s name we found in the previous step. In this example, we append /MSExchangeRepl to get ServiceHealthMSExchangeReplEndpointProbe/RPC/MSExchangeRepl. Error: The error returned by this Probe, if it failed. Exception: The callstack of the error, if it failed. ResultType: An integer that indicates one of these values: 1: Timeout 2: Poisoned 3: Succeeded 4: Failed 5: Quarantined 6: Rejected ExecutionStartTime: When the Probe started. ExecutionEndTime: When the Probe completed. ExecutionContext: Additional information about the Probe’s execution. FailureContext: Additional information about the Probe’s failure. Some Probes may use some of the other available fields to provide additional data about failures. We can use XPath to filter the large number of events to just the ones we are interested in; those with the ResultName we identified in the last step and with a ResultType of 4 indicating that they failed: $replEndpointProbeResults = (Get-WinEvent –ComputerName <Server> -LogName Microsoft-Exchange-ActiveMonitoring/ProbeResult -FilterXPath "*[UserData[EventXML[ResultName='ServiceHealthMSExchangeReplEndpointProbe/RPC/MSExchangeRepl'][ResultType='4']]]" | % {[XML]$_.toXml()}).event.userData.eventXml To get a nice graphical view of the Probe’s errors, you can run: $replEndpointProbeResults | select -Property *Time,Result*,Error*,*Context,State* | Out-GridView In this case, the full error message for both Probe Results suggests making sure the MSExchangeRepl service is running. This actually is the problem, as for this scenario I restarted the service manually. Summary This article is a detailed look at how you have access to an incredible amount of information about the health of Exchange Servers. Hopefully, you will not often need it! In most cases, the alerts will be enough notification and the included cmdlets will be sufficient for investigation. Managed Availability is built and hardened at scale, and we continuously analyze these same events collected in this article so that we can either fix root causes or write Responders to fix more problems before users are impacted. In those cases where you do need to investigate a problem in detail, we hope this post is a good starting point. Abram Jackson36KViews0likes3Comments