Written byMark Russinovich, Chief Technology Officer and Technical Fellow, Microsoft Azure
“Changes to Azure services and the Azure platform itself are both inevitable and beneficial, to ensure continuous delivery of updates, new features, and security enhancements. However, change is also a primary cause of service regressions that can contribute towards reliability issues—for hyperscale cloud providers, indeed for any IT service provider. As such, it is critical to catch any such problems as early as possible during the development and deployment rollout, to minimize any impact on the customer experience. As part of our ongoingAdvancing Reliabilityblog series, today I’ve asked Principal Program ManagerJian Zhangfrom our AIOps team to introduce how we’re increasingly leveraging machine learning to de-risk these changes, ultimately to improve the reliability of Azure.”—Mark Russinovich, CTO, Azure
This post includes contributions from Principal Data Scientists Ken Hsieh and Ze Li, Principal Data Scientist Manager Yingnong Dang, and Partner Group Software Engineering Manager Murali Chintalapati.
In our earlier blog post “Advancing safe deployment practices” Cristina del Amo Casado described how we release changes to production, for both code and configuration changes, across theAzureplatform. The processes consist of delivering changes progressively, with phases that incorporate enough bake time to allow detection at a small scale for most regressions missed during testing.
The continuous monitoring of health metrics is a fundamental part of this process, and this is where AIOps plays a critical role—it allows the detection of anomalies to trigger alerts and the automation of correcting actions such as stopping the deployment or initiating rollbacks.
In the post that follows, we introduce how AI and machine learning are used to empower DevOps engineers, monitor the Azure deployment process at scale, detect issues early, and make rollout or rollback decisions based on impact scope and severity.