Modern Service Management Blog Series Part 2: Monitoring and Major Incident Management
Published Mar 29 2017 07:13 AM 14.5K Views
Microsoft

[This is the second blog post from our blog series on Modern Service Management for Office 365. These insights and best practices are brought to you by Carroll Moon, Senior Architect for Modern Service Managment.]

 

In the initial blog post in this series, we framed the Office 365 Service Management discussion into five categories:

  1. Monitoring and Major Incident Management...knowing if your users are impacted (regardless of root cause) and ensuring that the right things happen without heroics when users are impacted
  2. Evergreen Management...being ready to successfully absorb the changes and to achieve business value from the evergreen service
  3. Service Desk and Normal Incident Management...being ready to support Office 365 end-users leveraging the automation investments from the Office 365 service and being able to measure the call and escalation rates driven by your users on-premise and in the cloud
  4. Administration and Feature Management...managing the workloads and configurations thereof through the Admin Portal as well as programmatic management
  5. Business Consumption and Productivity...a higher order focus on the business to drive transformation using Office 365 capabilities do drive more business, more productivity, and lower costs

 

This blog post will focus on Monitoring and Major Incident Management for Office 365.  For more thoughts on overarching cloud monitoring, read the eleven posts in this blog series that Microsoft wrote for ITIL. 

 

Monitoring in the realm of Major Incident Management

Monitoring is a broad topic.  For now, we will focus on “Availability and Performance Monitoring” for Office 365.  Receiving monitoring alerts without a downstream action and workflow will not accomplish much, so we will focus on Availability and Performance Monitoring within the Major Incident Management workflows that it supports.  We will use the following diagram to help in the discussion:

Service Management.png

In the diagram above, we are representing users from the customer premise connecting to Office 365 via “A” through Express Route and via “B”+”C” internet route.  Also, many customers have users that connect directly from the internet in addition to connecting from customer premises. 

 

Major Incident Management Scenarios and Portal Specificity

From a Major Incident scenario perspective, if we focus on “cloud only” rather than “hybrid” for simplicity, there are only three Major Incident scenarios:

                     I.            (Customer has help desk calls OR end-to-end alerts) AND (Microsoft posts something for the customer’s tenant)

                   II.            (Customer has help desk calls OR end-to-end alerts) AND (Microsoft has NOT posted something for the customer’s tenant)

                  III.            (Customer does NOT have help desk calls OR end-to-end alerts) AND (Microsoft posts something for the customer’s tenant)

  

Now is a good time to speak to tenant specificity in the Office 365 Service Health Dashboard and Message Center.  Most people do not know that the communications dashboards are tenant-specific.  We do not have humans writing millions of paragraphs to publish uniquely to each tenant.  Rather, we write one paragraph and publish it to all relevant, possibly impacted tenants.  That is why we have an authenticated dashboard experience.  If we have the admin log in, we know who the admin is.  If we know who the admin is, we know the tenant.  And if we know the tenant, we know the capacity that the tenant’s users depend upon.  Thus, we can direct communications to the appropriate tenants as necessary.  Our systems allow us to post to a single tenant, to every tenant on the planet, or more likely, to a subset of tenants.  For example, we may get an alert that tells us “based on statistics, we know there is Outlook-connectivity impact for some North America users.”  In that scenario, we might automatically post that we are investigating Outlook-connectivity issues to all tenants with users in North America so the customers can get in front of any Help Desk volume and so the IT Pros can notify their management quickly.  Moments later, as more internal telemetry fires, we might know that the impact is limited to a particular unit of capacity.  At that point, we would update the post to reflect impact only to the tenants who have one or more users on that particular capacity.  Those tenants would continue to see the Incident, but the other tenants in North America would then see the issue as a “false positive”.

 

Major Incident scenario “I” is a fairly cut and dry scenario.  In that case, the customer knows they have impact end-to-end and Microsoft has published a corresponding incident in the dashboard.  The customer workflow would likely be to give the help desk a talk-track, to stand up automated voice response to deflect the help desk calls, to notify senior management, etc.

 

Major Incident scenario “II” is where the customer is getting help desk calls or end-to-end alerts, but Microsoft has not posted anything for the customer tenant [yet].  In this scenario, it could be a Microsoft issue that has not posted yet (in this case, soon, we will let you “tell us about issues” quickly from the admin portal.  It could be a customer-side issue.  Or it could be an issue in between (e.g. an Internet Service Provider issue).  In this scenario, the customer would likely stand up an Incident bridge on their side to begin troubleshooting the scope and root cause of the issue.  The customer would likely give their help desk a heads up, and they would likely engage senior management.  The customer would pull in Microsoft support when their triage process determines that it is appropriate.

 

Major Incident scenario “III” is also fairly simple.  In that case, there are no end-to-end alerts or user calls to the help desk, but Microsoft has posted something for the customer tenant.  In that case, it could be

  1. A false positive (per the scope example above)
  2. A real issue for a feature that the customer does not care about at the moment.  For example, we may post a Service Incident for “the ability to assign licenses” and the customer is not assigning licenses right now, so it is not an issue.  But another customer might be in the middle of massive mailbox migrations, so license assignment is very important to them at that moment.
  3. A real issue for real users but not enough to trigger end-to-end alerts or help desk calls.  Perhaps we post that “1% of emails are delayed up to 2 minutes”.  In that example, the impact is probably not enough to make your end users call the help desk nor the is it severe enough to make your end-to-end monitoring fire, but the impact is real nonetheless.  Or perhaps only one of the customer’s users is on a particular unit of capacity that is actually impacted.  If only that user is on the capacity, the test account used for end-to-end alerts would not be impacted.  And if that user is on vacation, she will not call the help desk to report the impact.  Recent improvements in providing user counts for impact in the Service Health Dashboard are intended to help with this scenario; note screenshot below:

 Service Management_2.png

 

In Major Incident scenario “III”, the customer workflow is likely to give the help desk a talk-track, to ask the help desk to be on high alert and to page the appropriate team if they start receiving calls about the issue, and to email senior management with a heads up as a safety precaution.

 

Monitoring Scenarios

In support of the Major Incident scenarios, there are six core monitoring scenarios that we need to discuss (we will add more scenarios over time):

 

A)       Does Microsoft think my tenant is impacted (Microsoft-side)?

B)       Does Microsoft think that I need to take action to get healthy or to stay healthy with my tenant (Customer-side)?

C)       Does Microsoft think that I need to be aware of an upcoming release for my tenant?  NOTE: we will discuss this bullet more in the forthcoming Evergreen Management blog post

D)       Does Microsoft think that I need to be aware of general Service Management information for my tenant?

E)       Is AAD Connect and/or ADFS working well on both ends of the service?

F)        Are the Capabilities that my users depend on working well end-to-end?

 

Scenario A’s information is available via the Service Health UI in the Admin Portal.  It is also available via the Office 365 Service Communications API under the “Service Incident” class.  There is an Office 365 Mobile Admin app that allows for Push Notifications.  And finally, there is a SCOM Management Pack for Office 365 that pulls the relevant information from the Service Communications API.  Finally, per recent announcements, soon we will let you sign up to “stay informed via your preferred channel” for Service Health information via text or email.

 

Scenario B thru Scenario D are all available using the “Prevent or Fix Issues”, “Plan for Change”, and “Stay Informed” categories respectively.  As with Service Incidents, Message Center information is available programmatically thru the Office 365 Service Communications API using the “Message” class with filters for each category. 

 

For Scenario A thru Scenario D, most enterprise customers should pull the information into their existing monitoring toolset.  If that existing toolset is Systems Center Operations Manager (SCOM), as noted, there is a management pack already published to pull in that information.  If the customer does not have SCOM and does not have plans to bring in SCOM, the customer simply needs to take two steps:

i)  Poll the Service Communications API every N minutes for each scenario (logging relevant information to the event log of the host machine when there are new or updates posts)

ii) Monitor the event log of the host for the pertinent events to create specific alerts mapped to downstream scenarios and workflows

 

To simplify this discussion for customers, we publish code samples for the v1 API (v2 is still in preview) here.  The downloadable zip file includes samples.  We are also creating sample scripts for “i" and guidance to simplify “ii” (i.e. which simple rules should be created in the monitoring toolset?).  Those artifacts will be published via this blog series, so keep your eyes open for those upcoming blog posts.  NOTE that the service account used to access the Office 365 Service Communications API will require “Service Administrator” permissions.

 

For Scenario E, there are many options.  For the sake of this particular blog post, let me simply point you to the Azure Active Directory Connect Health feature within Azure AD Premium.  That solution delivers great monitoring, performance, and insight data end-to-end that can be easily integrated into the customer’s existing workflows.

 

Finally, Scenario F has many options too.  To simplify that discussion, we should start with the most-important capabilities for each workload for each customer.  By capability, I mean “what will the users say is broken when they call the help desk?”.  For Exchange Online, the main capabilities are usually the following:

·         Login via Outlook

·         Mailflow

·         Mobile Sync

·         Line of Business Applications using EWS, authenticated SMTP Send, or another protocol

For SharePoint Online, the main capabilities are usually the following:

·         Login via Browser

·         General SharePoint Features like Lists and Document uploads

·         Custom Line of Business Portals

For Skype for Business, the main capabilities are usually the following:

·         Login via Skype for Business Client

·         Instant Messaging and/or Presence

·         Voice and Video

 

Balancing Investment and Reward

In all examples, there is usually an investment versus reward discussion.  The key for the capabilities is to test end-to-end from the location(s) that the customer’s users will be connecting from.  If 80% of a customer’s users are in the headquarters location, then the customer should test end-to-end from headquarters. The customer’s end-to-end tests should be aimed at the most-important capabilities to their business, but that should be balanced with investment based on experience of issues caught or missed end-to-end.  Most enterprise customers have significant investments already in monitoring tools.  Most of those tools can easily do an http-get synthetic transaction from the customer premise to Office 365 verifying connectivity over port 443.  Many of those customers can also use their existing tools to actually log in using a script or other synthetic transaction.  For example, a simple Exchange Web Services (EWS) login script running every N minutes on premise would verify end-to-end authentication as well as meaningful service-based activity over port 443.  Some customers invest further to build the more complex synthetics like maiflow, voice and video, etc.  And some customers choose to rent those more complex synthetics from one of the many 3rd parties who focus on building those synthetics as a business.  Regardless of the depth of synthetic for Scenario F, one must remember to run the synthetic from the location(s) where the users reside.  For example, if most users login from the corporate network rather than the internet, it will not give great coverage for that customer to run synthetics from the internet.  That customer would want to run synthetics from the corporate network where their users work. 

 

It is also important to mention that in addition to synthetic transaction monitoring, there is a growing trend in the industry to monitor real users’ experiences with a particular application or service under the “Real User Monitoring (RUM)” umbrella.  For example, what is my CEO’s experience with Outlook right now?  What is the aggregate OneDrive experience for all of my VPN users right now?  What is the Skype for Business experience like for all of my internet users right now?

 

The monitoring discussion can go on forever.  It is such an exciting topic.  Over time, we will cover specific monitoring scenarios in more depth.  The Office 365 service will continue to evolve on the monitoring front, so we will re-visit this topic often. For now, we wanted to start by framing the discussion. The true solution to monitoring Office 365 is in joining the information from the Office 365 Service Communications API with the end-to-end alerts (and help desk data) from the customer premise—the outer-most-ellipse in the diagram above.  And, of course, we want to enable all our enterprise customers to easily wire their existing monitoring toolsets to the Office 365 Service Communications API using the sample scripts and monitoring integration content found here.

 

Finally, many of our enterprise customers ask for an experienced person to come work with them like a personal trainer to shorten their learning curve to the cloud, especially on the Monitoring and Major Incident Management front.  If you want hands-on help to plan or implement, just ask your Technical Account Manager about ITSM help for Office 365, or email me at carrollm@microsoft.com (via Twitter @carrollm_itsm). 

 

What should you go do?

Here is the short version of recommend go-dos:

  1. Integrate Service Health Dashboard notifications into your existing Major Incident workflows.  Ideally, integrate Service Incident notifications into your existing tooling and workflows per above.  Download the existing samples here, and look to our future posts in this blog series for additional samples and guidance
  2. Assign someone to be accountable to triage Message Center content daily.  Ideally, integrate Message Center notifications into your existing tooling and workflows per above
  3. Join your end-to-end monitoring (and help desk triggers) to the Service Health Dashboard content provided by Office 365 per bullets “I”, “II”, and “III” above
  4. Stay tuned to this blog series for exciting innovations and new features from Office 365 on the monitoring front in the months ahead
1 Comment
Version history
Last update:
‎Mar 29 2017 10:22 AM
Updated by: