Demystifying Service Health for Microsoft Intune

Silver Contributor

Oct 30, 2018

First published on TechNet on Jun 28, 2017
In this post, we share details on service-impacting incident response and service health for Intune. While we always want the service to be 100% available, there are infrequent times when we experience a service incident. The goal of this blog post is to demystify the experience. Below, we walk you through:

Who’s working on an incident
How we detect a service-impacting incident
Where you go to check on the status of service-impacting incidents
Where we post on service changes
By design, bugs, and other service non-incidents
Other ways to get service notices

Who’s working on an incident?

First, it’s important to describe who you may interact with or who behind the scenes is working towards resolution:

Customer support engineers (support) – these are the engineers who are available 24x7 both online and by phone for paid and trial subscriptions. Standard support is provided at no cost to customers. Premier Support customers incur charges for procedural questions (for example, how to configure an Intune feature). You can read more details on contacting support at the link here: https://aka.ms/intunesupport .

Our engineering team (software engineers) – the engineering team is assigned 24x7 on-call rotation shifts so that there’s representation across all service areas. A service outage with customer impact has an assigned Incident Manager, Impact Manager, Sr. Incident and Sr. Impact Managers, Service Engineers and Software Engineers. The Incident Manager drives the mitigation efforts while the Impact Manager represents the voice of the customer and drives clear customer impact communications. The Sr. Incident and Impact Managers are part of Intune’s senior leadership team and are there to help, unblock, and advise as needed. The Service and Software Engineers focus on immediate mitigation. While the engineering team working on the incident may not be visible to you, they are quickly working behind the scenes to restore service as quickly as possible.

Microsoft field and partner contacts – While a technical solution specialist (TSP), a premier field engineer, an account manager, etc., or a partner may not contribute to incident resolution, they may help you report it or keep you up-to-date on the incident status.

How we detect a service-impacting incident

There are three main channels by which we detect a service impacting incident:

Scenario 1: Intune detects a service outage

Intune has hundreds of monitors that track customer-facing components and back end service health and responsiveness. Every hour of every day we respond to alerts in a timeline based on their severity. Something like a certificate expiring in 90 days will have a ticket with less severity than a spike in login failures. If the alert is categorized as a Severity 0 - 2, our incident system immediately opens a ticket and auto-calls the 24x7 software engineers to review and respond to the alert.

Scenario 2: A customer reports an issue

A customer may see an issue and contact support to troubleshoot. The support agent will gather some baseline information. This helps us determine whether it’s something in the configuration of your environment, another service, or something in our environment. The support agent may work directly with you to close the ticket, or they may also escalate up the support tier levels.

Scenario 3: A supporting service has an outage

Intune is not a standalone service. For example, Intune’s company portal app can be downloaded from the Google Play and iTunes stores. If one of those stores has an outage, then our customers' ability to download the company portal app, in this example, could be impacted. The Intune software engineers and Microsoft have external relationships and support paths set up with each of the service-supporting companies and work with our contacts/processes when there’s a service-impacting incident. Internally at Microsoft, we have cross-team incident alerting built into the response process. Rarely, you may see a reference to another service’s outage posted on our service health dashboard. In this scenario, our team is working behind the scenes to see if we can take a service change to minimize the service outage. For example, if it’s a regional outage of a supporting service we may choose to offload to another region.

Where do you go to check service-impacting incidents?

You have two options - first off, look at the Tenant Status blade in https://portal.azure.com. There you'll see service health posts from the past 30 days. Other option is to head to the Office 365 Admin Console and select health to look at service health. Both sites refer to the same service data; preference is based on where you administer the service.

There you can see your tenant’s health across services you own, which could include Intune, Office 365, and CRM. Note that Intune used to have our own Service Health standalone page, but we merged with Office’s several years ago since we heard from many of you that you wanted your service health across Microsoft IT Pro services in one location.

There are multiple roles that will provide access to the Service Health Dashboard – you don’t have to assign everyone a global admin role. The roles that have access are:

Global Admin
Service Admin

At the writing of this post, Intune is healthy. If it wasn’t healthy, you’d see a different picture than the one below as described by the article here .

With the Intune service, our goal is to post within one hour of determining what customers are impacted. The impact could be limited to a specific scale unit (where your account resides), a region, or customers using a specific feature set. However, there are a few scenarios where we don’t feel it’s appropriate to post:

We believe the impact is limited to one customer, or a very small subset of customers, that we’ve already been in touch with, typically raised through a support case.
There’s a supporting service outage outside of Microsoft. If iTunes is down and there’s press on the event, we often don’t post (or we’ll direct you to another incident number/service page).
There was a service functionality change announced which changes behavior and is not an incident, but a service improvement (see more on that below).
Planned maintenance (see more on that below).
An incident closes within minutes and before we can even get the communication posted. Sometimes just restarting a service is all it takes to restore immediate service.
We will not post when it’s your environment causing the service issues, such as network throttling.

NOTE: There’s Incidents and Incident Advisories. An Incident is reserved for Sev0 incidents which are extremely rare. All other incidents are categorized as Incident Advisories and show up on the Advisories tab. From Intune’s standpoint, these are both incidents and you will see an explanation point on the service health dashboard (or another indicator) sharing that something’s up.

Where we post on service changes

To stay informed about the Intune service changes, again head to either the Tenant Status page in https://portal.azure.com or the the Office 365 Admin Console and login with your Intune admin credentials. For Tenant Status blade, you'll see the messages when you land on that blade. For the Office 365 Admin experience, select message center on the landing page, or on the left-hand navigation, click on health-> Message Center. There you’ll find messages about new features, planned changes, and planned maintenance with downtime expected.

In looking at the test tenant information screen shot below, there’s a few things to call out:

If you should take urgent action by a specific date, you’ll see a triangle with an explanation point and a date to act by.
The category “Stay informed” is used the same way it’s used in Office – for new features, or for example when we publish updates to you.
The category “Plan for change” is used when there’s an upcoming change in feature functionality. This could be as simple as a UI change whereby we’re notifying you, so you can update your end user guidance. Or it could be 30-days’ notice of a change that’s coming. You many not use the feature that’s changing, but we tend to over-communicate versus under since we don’t know your intent.
Planned maintenance (not incident, but scheduled downtime) is posted in the message center based on customer request to put all messages in the same spot. We used planned maintenance when we expect there to be downtime. When possible, we schedule planned maintenance outside of your core region’s working hours.

NOTE – You can sign up to just see or get emailed Microsoft Intune announcements. Use the edit message center preferences and select which services you follow or would like to receive weekly emailed digest summaries.

By design, bugs, and other service non-incidents

Finally, there are times that we’re not going to post because something has been released in a way that’s not an incident, but rather it’s by design. In addition, sometimes we don’t post an incident because it’s actually a software bug that will be resolved either out of band or with the next build. If you feel you are impacted by an incident but there is not an incident posted to your service health dashboard, please contact support and they will assist you.

Other ways to get service notices

You have two additional options for accessing service health outside of the O365 console:

The Office 365 Admin App is available for mobile devices from the app stores: https://products.office.com/en-US/business/manage-office-365-admin-app . You can set up toast notifications, file support tickets directly from the app, and check messages and service health. You can even forward message center posts via email, so, say you’re on vacation and get a toast notice about a plan for change – you can forward that message directly to someone else to handle.