Service and Normal Incident Management for Office 365

Anne Michels · ‎Jul 24 2017

The time has come for us to discuss the Service Desk and Normal Incident Management. I am very excited about this post because it will complete the baseline, core content for the series. As a reminder, there are four questions that I get most often about Office 365 and Service Management:

How do I monitor my Office 365 experience and which pieces of the puzzle will come from Microsoft?

How do I prepare my organization to absorb (and benefit from) the new, evergreen features from Microsoft?

IT Agility to Realize Full Cloud Value - Evergreen Management

What are the IT Pro and IT Organizational impacts of Office 365?

Evolving IT for Cloud Productivity Services

How do I get my Service Desk ready to support Office 365?

This blog post!

What is a Service Desk? What is Normal Incident Management?

Most of us are aware of the term "Help Desk". That term implies a reactive function that a user would call if they have an issue. "Service Desk" is the industry term for the one-stop-shop function for IT engagement - both reactive and proactive. The Service Desk would take the "I need help fixing XYZ" calls along with "will you add another network cable to my office?" and "we need to order a new PC for an employee who is starting next month" requests, and the Service Desk would be accountable for the user experience with those incidents and requests.

In a previous post, we covered Major Incident Management. Major Incident Management deals with incidents that meet certain criteria in terms of scope or criticality. In an enterprise IT shop, Major Incidents are usually coordinated by the "Network Operations Center (NOC)" or by a Crisis Management team. Most enterprises will establish a crisis phone bridge for the duration of a Major Incident, and all-hands will be on-deck until the incident is resolved. On the other end of the spectrum are Normal Incidents. Normal Incidents are about the routine, end-user calls. "I am getting prompted for my password continually" or "I cannot open my spreadsheet." Of course, those routine calls add up in terms of cost-to-IT, cost-of-user-time-to-engage-IT, and lost-user-productivity costs for the business. Yet, the rigor and the response are generally less for Normal Incidents than they are for Major Incidents.

With respect to Office 365, there are three key areas that should be considered for Service Desk and Normal Incident Management:

Leveraging Microsoft's investments in the Normal Incident flow
Ensuring the right accountability model
Understanding and acting upon trends

Leveraging Microsoft's investments in the Normal Incident flow

Within enterprise IT, the Normal Incident Management flow is usually a tiered system. For example, there may be a "Tier 1" team who takes the user call, logs a ticket, and perhaps does a little bit of follow-the-instructions recovery. Then there may be a "Tier 2" team who has slightly elevated permissions so they can take additional follow-the-instructions recovery steps. And then, there is a "Tier 3" or engineering team who is the top point of escalation within the enterprise for a given technology; these folks have top-level permissions for the application or service in question, and they do not follow instructions; rather, they write the instructions for the lower tiers. Of course, some customers put their own spin on those tiers: maybe they have tier 1 only log the ticket, maybe they have a "Tier 1.5" that has slightly more responsibility, maybe they split "Tier 3" into multiple teams and have a higher-permissioned "Tier 4" team, and maybe they have an automated approach to recurring incidents that they call "Tier 0".

Regardless of any business-specific tweaks, most enterprises have some sort of tiered system. The user call goes to a tier (or tiers) where the IT representatives simply follow instructions that have been created for them. If the instructions do not resolve the issue, or if the timer elapses for their tier, then the IT representative simple escalates to the tier (or tiers) above them. Eventually, they reach the top tier where there are no instructions to follow. The top tier will then do their best to fix the issue. Eventually, if the same issue gets escalated to the top tier over and over, the top tier will spend the time to document the recovery steps into instructions (manual or automated) for the lower tier to use in the future. Some enterprises use data to drive the virtuous cycle. Some enterprises simply â€œwing itâ€ based on the whim of the top tier staff. Either way, the reality is that in most enterprises, this sort of virtuous cycle exists when it comes to Normal Incidents.

I always joke and ask customers "how many Fortune 500 companies are there?" Of course, the answer is "500". With millions of tenants on Office 365, simple mathematics tells us that many of those customers are small businesses. Logic tells us that many of those small businesses do not have formal IT staff. Microsoft invests heavily in automation and tooling to enable small businesses to successfully manage their tenants and their users even without formal IT skills.

Here's the thing: Microsoft has invested in a lot of tools and automation, but we enterprise IT folks do not think that way. We use our virtuous cycle as described above. The top tier team is the one who figures things out. But is there an opportunity for the top-tier team to "cheat" a little bit and use Microsoft's investments in tooling and automation to enable their lower tiers? Why not use the tools and automation from Microsoft to help that virtuous cycle go faster and better? Why not consider using the Support and Recovery Assistant (SaRA) tool and the Remote Connectivity Analyzer at the lower tiers immediately? Why not give the lower tiers "Service Administrator" permissions so they can use the admin portal to see if Microsoft has published something on the Service Health Dashboard that is relevant to the user(s) call? Why not consider letting the lower tiers use the Support Experience within the Admin Portal themselves? Why not use the support wizard from Skype at the lower tiers? Why not implement the Skype Call Quality Diagnostics (CDQ) experience for all tiers? Why not have the top-tiered employees use the Admin Portal's Support experience to search for the most recent content to see if that content can be pushed down to the lower tiers (see screenshot below)? blog post.jpg

This point is usually the easiest point to make to help customers be successful with Office 365 end-user support. Customers simply need to take the investments that Microsoft has made (and continues to make) in the admin portal, in tooling, and in automation to help their existing tiered-virtuous-cycle (using their business policy lens as a guide) in order be more successful with end user support for Office 365 more quickly. Many customers have told me that they actually have higher (better) resolution rates at their lower tiers with Office 365 than they had on-premise once they integrate Microsoft's investments in tooling and automation into their tiered workflows.

Some customers then take the virtuous cycle one step further and push some of the automation all the way to the end users. For example, some business users are more technical than others, so perhaps consider having some of your end users run the Support and Recovery Assistant (SaRA) tool before calling the IT Service Desk. Or perhaps IT works with their "super users" within each department already, and IT can enable those super users with the same tools that they give their lower tiers. Once that thought process starts, the game changes, because not only are the issues being resolved at lower Service Desk tiers (lower cost), but the issues start getting resolved by the end users (no cost to the Service Desk and quicker resolution for the user). The goal for IT should be to use Microsoft's investments in tooling an automation to drive resolution to lower tiers and ultimately to the end-user.

To recap, I encourage everyone to consider the following recommendations specific to leveraging Microsoft's investments in your Normal Incident flow:

Have your top-tier resources use the admin portal and the tooling to solve issues so they stay up-to-date on new innovations and so they can decide what tooling they want to push down to lower tiers
Ensure that lower-tier resources have access to the admin portal and tooling (likely using least-privilege permissions of "Service Administrator")
Consider pushing resolution to the end-users where it makes sense (e.g. SaRA tool prior to calling the IT Service Desk for Outlook issues?)

Ensuring the right accountability model

In most enterprises, the Service Desk carries metrics. Whether the Service Desk is in-sourced, out-sourced, or a combination, there are usually metrics involved. Typically, those metrics hinge on a) number of calls resolved in the period and b) how long the calls took to resolve (i.e. cost of the call). Those metrics often are then broken down based on the tier in question. For example, Tier 1 may need to solve X calls per month, and they may only be allowed to follow their instructions for 10 minutes per call. I am making generalizations here, but most enterprise Service Desks carry these fundamental metrics.

Here's the issue: If I am the Service Desk Director, and my bonus is tied to the metrics of "how many calls did I resolve" and "how fast did I resolve them", then I am not necessarily incented to eliminate easy-to-solve calls because eliminating calls works against my call volume metrics, nor am I necessarily incented to drive down the duration of the call as long as I am hitting my target duration metrics. If I get 100 calls a month that I solve by simply following instructions to rebuild an Outlook profile that may only take a few minutes, then I may be happy to keep solving those 100 calls every month because they help me achieve my metrics. And if that number goes to 150 calls per month, I may be happy about that fact because my numbers look even better. Do you see the issue? Most enterprise Service Desks are set up to perpetuate status quo (or worsening volume of) user calls rather than being incented to do the work to eliminate the calls. In the perfect world, the Service Desk should get rewarded more for working with the top-tier escalation teams to eliminate classes of user calls altogether than for resolving that class of call over and over (drive the count of calls down). The Service Desk should get rewarded for working with the top-tier teams to reduce the duration-to-resolve (cost to IT, impact duration to user) classes of user calls rather than always be satisfied with the current resolution time achievements. We should all want to have metrics that drive the desired outcomes: fewer calls and shorter duration which mean lower IT costs and happier, more productive users. But, in most IT shops, we are measuring the exact opposite metrics ”how many calls did we resolve (incentive to stay the same or increase) and how quickly did we solve them (no incentive to continually drive duration down). In a purist world, I should target metrics to have zero calls (because we've eliminated all known issues). And every call that does come in should be a brand new issue that Service Desk has never seen before, and therefore, resolution time may actually be longer because we have to work with the top-tier teams to find the root cause. Then, we should eliminate that issue too so we never get the call again.

Metrics drive behavior. Are your Service Desk metrics driving the outcomes that you want? Or do you need to tweak your metrics, at least for the cloud workloads, in order to incent the right outcomes? (especially in light of the "virtuous cycle" discussion above)

Understanding and acting upon trends

In my ~15 years working with customers on what is now called Office 365, there are only two issues that pop up with respect to Office 365 and Service Desk/Normal Incident Management: I) is raw call volume going up? II) are escalations to the higher, expensive tiered resources going up? Having the right accountability/metrics and using Microsoft's investments will completely change the game with those two points. However, there is still one possible "gotcha", and that is when perception does not meet reality. But, of course, perception is everything. Therefore, we all need to prepare ourselves to have an objective discussion rather than a subjective, perception discussion.

Often, enterprise Service Desks do not have good call breakdown data per workload on-premise. The Service Desk may not know how many calls per month they get for Outlook on-premise, for example. That may be because their ticket taxonomy does not lend itself to good reporting, and/or it could be because the humans entering the data are making mistakes with data entry. Either way, if there is not good data, we are destined for subjective discussions. When customers without good data move to the cloud, for example when they move to Exchange Online, they likely still would not have good data about their call volumes for Outlook. In that case, if the Service Desk Director tells the CIO that call volume has gone up, who is the Office 365 Project Lead to disagree with that point without objective data? If there is no data (historical and current), then who knows what the volume actually is? And more importantly, how does the Project Lead know when to dig in to the call volumes and the root causes thereof in order to eliminate classes of issues or to drive down resolution durations? And of course, the same is the case with escalation rate. If there are no historical trends for "escalations to top tiers for Outlook issues", then how is one to know if the volume of escalations has gone up so the project team can intervene? If the Tier 3 team manager tells the CIO that escalation rates from lower tiers have increased, is the objective data there to enable that discussion?

I encourage everyone to think thru the following recommendations specific to understanding and acting upon trends:

It is advisable to have a view of historic and current metrics trends on call volume, duration and escalation rates per workload.
One goal should be to have better count and duration metrics in the cloud than on-premise. As I've noted, achieving better results with the cloud are possible very quickly with the approaches described above.
Another goal should be to immediately catch any spikes in volume or escalation rate so the project team can intervene. (Note, these recommendations could be applied to any IT Service and not just to Office 365 or other cloud workloads)
It is advisable to pivot the metrics on "newly migrated users" versus "run state users" so the project team can dig into migration steps (e.g. "let's add additional end-user-training for department X for scenario") where appropriate.
It is advisable to include a normalized view for each metric rather than just total. For example, if we took 100 calls last week and we took 1000 calls this week, did we really get worse? If we had 100 calls for 10 users last week (10calls/user) and we had 1000 calls for 1000 users this week (1 call/user), we actually got 10x better week over week, but the aggregate number would imply that we got 10x worse. It is advisable to show both the total and normalized-per-user metrics.

A trend report may resemble the following example: blog post 2.png

Wrapping it up

I hope that all of the points and ideas above make sense. My hope with this blog post is that everyone understands and considers the points that I am making. I do not anticipate that everyone will agree with each point. I understand that everyone's processes exist in their current form for very good reasons. Some people may ask "why would I change my metrics for Office 365?". Others may ask "why would I use Microsoft's tools in my Service Desk flow?" Others may say "my current Service Desk scorecard is perfect and has been in place for N years." To all of those questions and points, my response is "you know your business better than I will ever know your business, but I do encourage you to spend the time pondering my points above to see if they can help you with Office 365 and with your approach to Service Desk and Normal Incident Management in general. The points and ideas that I am making above are applicable far beyond Office 365 and far beyond Microsoft. My points are really about the modernization of IT." Thanks for reading the post, and happy thinking! Let's chat on twitter (@carrollm_itsm)

Next up...I'll write a post for a topic under the "Business Consumption and Productivity" umbrella (see this post for context).

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Service and Normal Incident Management for Office 365