Tenant health transparency and observability
Published May 08 2024 09:00 AM 4,774 Views
Microsoft

In previous resilience blog posts, we’ve shared updates about the continuous improvements we’re making to resilience and reliability, including our most recent update on regionally isolated authentication endpoints and an announcement last year of our industry-leading and first of its kind backup authentication service. These and other innovations behind the scenes enable us to deliver consistently very high rates of availability globally each month.  

 

In this post, we’ll outline what we’re doing to help customers see how available and resilient Microsoft Entra really is for them, to not only hold us accountable when issues arise, but also better understand what actions to take within their tenant to improve its health. At the global level, you see it in the form of retrospective SLA reporting, which shows authentication availability exceeding our 4 9s promise (launched in spring 2021) by a wide margin and reaching 5 9s in most months. But it becomes more compelling and actionable at the tenant level: what is the uptime experience of my users on my organization’s apps and devices? Is my tenant handling surges in sign-in demand?   

 

We often hear from customers about the effect on resilience insights when they move to the cloud. In the on-prem world, identity health monitoring occurred onsite and with tight control; operational awareness happened entirely within a company’s first-party IT department. Now, we need to achieve that same transparency or better in an outsourced, cloud-based identity service and with a federated set of dependencies.  

 

IT departments and developers are working hard to ensure each of their users maintains seamless, uninterrupted access that doesn’t compromise security. Enabling access for the right users with minimal friction while stopping intrusions and risk is critical to keep the world running. When an organization outsources their identity service to Microsoft, they expect us to acknowledge degradations when they happen, then take accountability to learn and continuously improve from those events. We also recognize that human-driven communication can only take us so far.   

 

To meet these challenges, we’re increasingly embracing granular monitoring and automation. We start from the assumption that the unexpected will find a way of happening in any complex system, no matter how resilient it is. Beyond resilience, we must detect incidents, respond to them effectively, and improve as we go—and help our customers do the same. You see examples of this approach both in our rollout of in-tenant health monitoring and in our investments behind the scenes aimed at fast incident detection and communication.  

 

Let’s start with out-of-the-box automated health monitoring in premium tenants. Tenant-level health monitoring empowers customers to independently understand the quality of their users’ experiences with authentication and access. It also sets the stage to prompt tenant administrators with actions they can take to investigate and reduce disruptions, all from Microsoft Entra admin center or using MS Graph API calls.  

 

We’ve taken a step in this direction by introducing a group of precomputed health metric streams that enable our premium customers to watch key authentication scenarios, an early milestone in our investments to enhance transparent visibility into tenant health and service resilience. These new health metrics isolate relevant signals from activity logs and provide pre-computed, low-latency aggregates every 15 minutes for specific high-value observability scenarios.  

 

With their granularity and scenario-specific focus, health metrics go a step beyond the monthly tenant-level SLA reporting we released in 2023. Precomputed health metrics also supplement the activity log data that we’ve been providing and continue to improve on. With sign-in logs, customers can build their own computed metrics to monitor, like isolating a specific sign-in method to watch for increases in success and failure. With our new precomputed streams, customers can snap to Microsoft-defined indicators of health, take advantage of features we’re developing at scale, and dive into activity logs for deeper investigations. We encourage customers to make use of both options to get a full picture.  

 

During the initial public preview offering, we’re releasing health metric streams related to maintaining highly available:   

 

  • Multifactor authentication (MFA)  
  • Sign-ins for devices that are managed under Conditional Access policies  
  • Sign-ins for devices that are compliant with Conditional Access policies  
  • Security Assertion Markup Language (SAML) sign-ins   

 

We’re starting with authentication-related scenarios because they are mission critical to all our customers, but other scenarios in areas like entitlement management, directory configuration, and app health will be added in time along with intelligent alerting capabilities in response to anomalous patterns in the data. We’re publishing the health metrics in Microsoft Entra admin center, Azure Portal, and M365 admin center, as well as in Microsoft Graph for programmatic access and integration into other monitoring pipelines.  

 

For more information about how to access the health monitoring metrics, visit the Microsoft Learn documentation.  

 

 

Figure shows the Scenario monitoring landing page & the Sign in with MFA scenario detailsFigure shows the Scenario monitoring landing page & the Sign in with MFA scenario details

 

sdriggers_2-1714507669046.png

 

Even as in-tenant observability improves, customers will still rely on traditional incident communications when Microsoft-side issues happen. Like all service providers, we push messages about incidents to affected customers and post service health announcements to a website and communications feed in Azure. However, when this approach relies solely on hand-crafted service monitors and human-driven communications, it has limitations. Customers are right to have concerns about the timeliness of communication and the monitoring coverage itself.   

 

To address this challenge, we’re building increasingly sophisticated default monitoring packages attached to automated communications. The early results are promising. We’ve been able to bring times to notify customers about incidents down significantly, with service degradations and downtime being communicated within about 10 minutes of auto-detection. We’re also catching service degradations increasingly early by investing in monitoring, the results of which we track by watching customer-reported incident volumes.   

 

The  best incidents are the ones that never happen. Our goal is to find and mitigate problems before they impact our customers. So, in addition to advances, we continue to prioritize building systematic resilience measures to prevent service degradations and outages or auto-mitigate them before they affect a customer environment. We will share more on this in a future blog.   

 

To continuously improve our services in partnership with our customers, we’re combining improvements in our service-level safety net with tenant-level monitoring. We’re also expanding our monitored scenarios, boosting our out-of-the-box monitoring intelligence, and speeding up our communication. Plus, integration with Azure, M365, and Microsoft Graph ensures that Microsoft Entra observability can happen wherever it’s needed. Together, we’re making sure everyone can work securely and seamlessly. 

 

With our already strong foundation of availability and resilience, security-enhancing recommendations, and mature service monitoring and incident communications, we’re excited to see these new capabilities take Entra health transparency to the next level.    

 

Igor Sakhnov  

CVP, Microsoft Identity & Network Access Engineering   

 

 

Read more on this topic

 

Learn more about Microsoft Entra  

Prevent identity attacks, ensure least privilege access, unify access controls, and improve the experience for users with comprehensive identity and network access solutions across on-premises and clouds. 

1 Comment
Co-Authors
Version history
Last update:
‎May 15 2024 09:09 AM
Updated by: