Postmortem of the Az 5.0 release
Published Nov 18 2020 08:54 AM 4,421 Views
Microsoft

Earlier this year, I explained our principles about Azure PowerShell releases. On October 27, we released a major version update whose significant change was in performing the authentication to Azure.

Unfortunately, this major release created several issues for many of you. We did the postmortem about this release last week and wanted to share its content.

 

What happened?

With this release, we changed the library used to perform the authentication to Azure moving from ADAL to MSAL (Microsoft Authentication Library).

Customers encountered errors related to the following categories:

  1. When the user browser cannot be launched, ‘Connect-AzAccount’ fails with a cryptic error message.
  2. Long-running operations did not complete successfully.
  3. PowerShell scripts cannot authenticate to Azure using User managed Identities.
  4. Customers cannot access the token cache using undocumented API.

 

Root causes

Different root causes have been identified for each category mentioned above:

  1. The ‘Connect-AzAccount’ default behavior changed in this version; the error message returned when we could complete the auth was unclear and did not provide recovery guidance to customers.
  2. The access token for Azure authentication was not refreshed.
  3. Our tests did not catch the passing of the wrong parameter to Azure Identity when using User Managed identity.
  4. With the switch to MSAL, token cache objects are no longer accessible.

 

Mitigation and resolution

We implemented the following fixes in Az.Accounts 2.1.2

  1. When we cannot launch the user’s browser, the error message provides guidance. We are carefully evaluating the impact of fallback behaviors on usability and script-ability in the context of PowerShell.
  2. We fixed the logic to refresh the access token.
  3. We changed the parameter that is passed on to authentication.

 

Impact

Each issue had a different impact as follows:

  1. In environments like Docker container or remote session (ssh/PowerShell remoting) to a machine, users were not given proper guidance on how to connect to Azure. CloudShell was also impacted in certain scenarios.
  2. Azure PowerShell cmdlets running over one hour would fail with an “Unauthorized” error. The most common scenario is an ARM template deployment.
  3. Authentication to Azure would fail for PowerShell scripts running in Azure functions with user Managed Identity.
  4. First-party modules using direct access to the Token Cache cannot complete their authentication.

 

How did it go?

The identification of the root cause of the issues was quick however, the PowerShell gallery suffered an unexpected outage on 10/30 delaying our ability to publish a fix for our customers.

 

Lessons learned

Even though the preview of the Az.Accounts module was available for 205 days; none of the issues faced were identified. We learned from this incident that our current approach regarding previews is not providing the expected feedback.

Our release monitoring systems did not surface those issues that would have allowed us to address them earlier.

We need to establish a recovery plan so that we can share mitigation with impacted customers and partners.

Our mock tests are missing some scenarios, especially in end-to-end testing.

 

Corrective actions

Based on the above analysis, we are considering the following:

  • We will provide a new cmdlet as well as developer guidance for accessing the token cache.
  • We are adding extra checks on PR’s regarding how to access the token cache.
  • We will share with the community our test matrix.
  • We will perform extra validation on the interactive authentication for each release updating the authentication library.
  • We will improve our testing procedures and increase the scope of our CI pipelines.
  • We will be adding tests for long-running operations.
  • We will investigate how we can apply some of the SDP practices to Azure PowerShell releases.
  • We will evaluate test scenarios in environments running PowerShell scripts (Azure Functions, Azure Automation, for example).
  • We will improve our monitoring procedures for new releases.

 

We plan to share our progress made towards those goals by the summer of 2021.

The Azure PowerShell team

 

Version history
Last update:
‎Nov 18 2020 08:58 AM
Updated by: