Advances in Azure AD resilience
Published Oct 28 2022 12:00 PM 8,837 Views
Microsoft

In today’s world, resilience and security are fundamental requirements of an enterprise grade solution - like Azure Active Directory (Azure AD), now part of Microsoft Entra. We place the highest priority on the resilience and enterprise grade security of Azure AD and continually invest to ensure we meet our customers’ expectations. 
 
We’ve been regularly sharing updates on our resilience investments. Earlier in this series, we discussed our 99.99% SLA, introduced the core principles and architectural basis for our resilience roadmap, and shared plans for highly differentiated resilience capabilities such as a built-in, fully automatic backup authentication serviceToday, we would like to share our latest progress, key results, and plans for further improvement.

 

Progress 

Our approach to resilience is multi-layered. At the core, the Azure AD service is built to the enterprise grade security and best practices adopted across the Microsoft Cloud. This means: 

 

  • The service is designed with multiple layers of internal redundancy to withstand failures of everything from VMs and network switches all the way to entire data centers. We regularly validate this in production, including failing out of entire data centers. 
  • We continually work to identify and remove single points of failure anywhere in the service or in its dependencies. As an example of an external dependency, Azure AD uses telecommunication partners to deliver SMS and voice notifications for MFA. To ensure resilience of message delivery we use several (up to five in each country) telco providers in a fully active/active way.  
  • All changes are applied using a fully automated and very strict Safe Deployment (SDP) process. This is designed to catch and stop the impact of any change-related issue as early as possible, ideally before any customer impact. The Safe Deployment process includes multiple layers of internal automated testing, a slow gradual rollout in production, and layers of telemetry health signals that can automatically stop and rollback any change that is causing unexpected impact. 

 
In addition to these core practices – to add layers of resilience – in this past year, we have:  

 

  • Advanced efforts to reduce the likelihood and impact of any potential disruptions (Backup Auth System and Regional Isolated Auth Endpoints). 
  • Completed migration to a Cell-Based Architecture with over 100 cells, which is designed to structurally reduce the scope of impact of any fault or service degradation. 
  • Innovated and delivered technologies like Continuous Access Evaluation (CAE) that enable applications to build in resilience. 

 

The combination of these investments is designed to provide a highly resilient core service, coupled with layers of resiliency built all the way into our SDK’s and apps that allow them to continue to operate unaffected, even if all the other layers of resilience fail. 
 
These resilience capabilities are available to all applications that integrate with Azure AD, not just those from Microsoft. Your applications can make the best of use of these resilience capabilities by leveraging the Microsoft Authentication Library (MSAL SDK).  

 
Specifically, the key resilience innovations that we delivered in the last year include:  

 

Backup Auth System:

The concept of multiple redundant systems has existed for mission and life critical systems such as flight computers and spacecraft for years. We introduced an equivalent concept over a year ago - Backup Auth System - to protect Azure AD.   
 
We have now completed the full production rollout of this system and it’s fully operational protecting both Microsoft and third-party applications. To ensure the backup system is there if it’s ever needed, it’s constantly being exercised for a portion of all traffic and inside Microsoft we regularly flip over to backup to measure its effectiveness. We are constantly raising the amount of protection the system offers and the number of auth scenarios covered by the backup auth system. 

 

Continuous Access Evaluation – ensuring apps can continue to operate:  

Modern Authentication depends on the ability of clients, such as Outlook, to periodically refresh user access to resources such as Exchange Online on a frequent basis, typically every hour. Earlier this year we introduced Continuous Access Evaluation (CAE) that allows Azure AD to proactively and instantly revoke access when security events are detected. With this innovation we’ve been able to improve security while also raising the refresh interval from 1 to 24 hours for supported resources. The extended refresh interval means existing sessions are able to continue to operate unaffected by any issues in the core Azure AD service for significantly longer (on average 12 hours or half of the 24-hour access lifetime).  

 

Thirty percent of our user sessions are presently benefiting from CAE’s enhanced security and extended resilience. During 2023, we plan to integrate more resources to increase this coverage. 

 

Cell-Based Architecture: 

Cell-based architectures create strongly isolated independent partitions of the service that are designed to contain the impact of any failure to that cell. An analogy is a ship with compartments or bulkheads that are able to flood without impacting any other compartment or affecting the integrity of the overall vessel. Azure AD has moved in the last year to 117 cells with no cell containing more than two percent of the traffic in the service. Cells themselves have multiple levels of resiliency built in. For example, serving traffic from multiple independent Azure regions and fully respecting the resilience principles outlined above.

 

This level of isolation at such a fine granularity means any given tenant (which lives in a Cell) is very strongly isolated from the vast majority of failures in other tenants.

 

Regionally Isolated Authentication Endpoints 

Azure AD is not just the identity system for users but is also the identity and access management (IAM) system for apps and services built on Azure. For example, Azure AD is used for all infrastructure authentication such as a VM authenticating itself to Storage. To ensure we have multiple layers of resilience there as well, we’ve introduced regionally isolated authentication endpoints in most Azure regions. Regional endpoints add a third layer of resilience even beyond the backup auth system and can serve authentications entirely in a region.  

 

At present, 95% of Managed identities for Azure Resources are served and protected by regional endpoints. This ensures that if there is degradation of the primary Azure AD authentication service, customer workloads built on Azure are not impacted.

 

Putting it all together - key results

We are pleased to report that in the last 12 months we met or exceeded our 99.99% Azure AD uptime SLA in every month except December of 2021. In that month, our SLA decreased to the 3 nine level or 99.978% driven by a service degradation that impacted some back-end systems. This issue affected <1% of our users and was resolved in less than 90 minutes.

   

The service degradation is a real-world demonstration of the value of our layered resilience measures in action where the multiple layers of resilience contributed to ensure most of our users saw no impact at all.  

 

What’s next

Our goal is to continually raise the bar on resilience, putting even more tools and capabilities in the hands of our customers. We will continue this series as a regular update by sharing both roadmaps progress like the above and deep dives on how some of our architecture and systems work to support the service resilience. In the meantime, we also recommend you check out our SLA details and uptime monthly update.   

 

Thank you.

 

Nadim Abdo 

CVP, Engineering

 

 

Learn more about Microsoft identity:

2 Comments
Co-Authors
Version history
Last update:
‎Oct 28 2022 09:35 AM
Updated by: