Blog Post

Microsoft Entra Blog
4 MIN READ

99.99% uptime for Azure Active Directory

nadimabdo's avatar
nadimabdo
Icon for Microsoft rankMicrosoft
Dec 18, 2020

Today, I’m pleased to announce that we are taking the next step in our commitment to the resilience and availability of Azure AD. On April 1, 2021, we will update our public service level agreement (SLA) to promise 99.99% uptime for Azure AD user authentication, an improvement over our previous 99.9% SLA.  This change is the result of a significant and ongoing program of investment in continually raising the bar for resilience of the Azure AD service. We will also share our roadmap for the next generation of resilience investments for Azure AD and Azure AD B2C in early 2021.

 

Because our identity services are vital to keep customer businesses running, resilience and security are and always will be our top priority. In the last year, we've seen a surge in demand as organizations moved workforces online and schools enabled study from home—in fact, some national education systems moved entire student populations online with Azure AD. Azure AD is now serving more than 400 million Monthly Active Users (MAU) and processing tens of billions of authentications per day. We treat every one of those authentication requests as a mission critical operation.  

 

In conversations with our customers, we learned that the most critical promise of our service is ensuring that every user can sign in to the apps and services they need without interruption. To deliver on this promise, we are updating the definition of Azure AD SLA availability to include only user authentication and federation (and removing administrative features). This focus on critical user authentication scenarios aligns our engineering investments with the vital functions that must stay healthy for customers businesses to run.

 

Of course, we will continue to improve reliability in all areas of Microsoft identity services. Last year, we shared our approach and architectural investments to drive availability of Azure AD. I’m pleased to share significant progress completed since then.

 

  1. We’ve made strong progress on moving the authentication services to a fine-grained fault domain isolation model -- also called “cellularized architecture”. This architecture is designed to scope and isolate the impact of many classes of failures to a small percentage of total users in the system. In the last year, we’ve increased the number of fault domains by over 5x and will continue to evolve this further over the next year.

  2. We have begun rollout of an Azure AD Backup Authentication service that runs with decorrelated failure modes from the primary Azure AD system. This backup service transparently and automatically handles authentications for participating workloads as an additional layer of resilience on top of the multiple levels of redundancy in Azure AD. You can think of this as a backup generator or uninterrupted power supply (UPS) designed to provide additional fault tolerance while staying completely transparent and automatic to you. At present, Outlook Web Access and SharePoint Online are integrated with this system. We will roll out the protections across critical Microsoft apps and services over the next few quarters.

 

  1. For Azure infrastructure authentication, our managed identity for Azure resources capabilities are now transparently integrated with regional authentication endpoints. These regional endpoints provide significant additional layers of resilience and protection, even in the event of an outage in the primary Azure AD authentication system.

  2. We’ve continued to make investments in the scalability and elasticity of the service. These investments were proven out during the early days of the COVID crisis, when we saw surging growth in demand. We were able to seamlessly scale what is already the world’s largest enterprise authentication system without impact. This included not just aggregate growth but very rapid onboarding, including entire nations moving their school systems (millions of users) online overnight.

  3. We are rolling out innovations to the authentication system such as Continuous Access Evaluation Protocol for critical Microsoft 365 services (CAE). CAE both improves security by providing instant enforcement of policy changes and improves resilience by securely providing longer token lifetimes.

The above are just some examples of the key resilience investments we have made that have enabled us to raise the public SLA to 99.99%. We will have more to share in 2021 on the next generation of resilience investments for Azure AD and Azure AD B2C.

Planning for resilience in your identity estate

We know many customers are also asking for guidance on how best to configure and use Azure AD in the most resilient patterns – to help you understand how to build resilience into your identity and access management estate, we’ve published technical guidance that provides best practices for building resilience into the policies you create.

 

Thank you for your ongoing trust and partnership.

 

Nadim Abdo

VP Engineering (Identity)

 

Updated Dec 18, 2020
Version 2.0
  • xdansmith, the five improvements mentioned in the post were all complete when Nadim posted in December. Since publishing the post, we’ve continued work on architectural, operational, and infrastructure reliability initiatives aligned with the principals described here: Advancing Azure Active Directory availability | Azure Blog and Updates | Microsoft Azure. We know our customers rely on Azure AD for vital apps, so resilience and security are our top priorities. We’ll follow up later this year with another post to let you know what improvements we’ve completed and what we’re working on.

     

    For a detailed description of the incident this week, you can see the RCA on the Azure Status History page (posted on 3/15 with tracking ID LN01-P8Z).

  • xdansmith's avatar
    xdansmith
    Copper Contributor

    Everything in this post sounds great. It'd be very much appreciated to have a follow-up post to this one to explain where things stand now and where they're going, specifically in respect to what took Azure AD down globally this week.

     

    Are the points in this post not yet live so the benefits haven't been realized?  

  • pgierveld The new SLA commitment is also per month and is applicable to licensed Azure AD Premium customers. The specific terms of the SLA including additional details on the scenarios covered and calculations, will be included in the public SLA update on April 1.

  • takatano The new SLA (99.99%) applies to Azure AD user authentication scenarios for all licensed Azure AD Premium customers. We are working through the details of updating the Azure AD B2C SLA and will announce changes there in the near future.

  • pgierveld's avatar
    pgierveld
    Copper Contributor

    The SLA was 99.9%, per month! Is the new 99.99% also per month? Meaning you will get a discount on the subscription for the month when there was an outage on Azure AD Services for your tenant, lasting longer than 4 minutes.

  • takatano's avatar
    takatano
    Copper Contributor

    Will the new SLA (99.99%) offered to both Azure AD and Azure AD B2C?

  • NadeemAkh's avatar
    NadeemAkh
    Copper Contributor

    Very impressive nadimabdo, pleased with the update. identity team should be very proud with these developments.