Written byMark Russinovich, Chief Technology Officer and Technical Fellow, Microsoft Azure
This blog post has been co-authored by Jeffrey He, Product Manager, AIOps Platform and Experiences Team.
InMicrosoft Azure, we invest tremendous efforts in ensuring our services are reliable by predicting and mitigating failures as quickly as we can. In large-scale cloud systems, however, we may still experience unexpected issues simply due to the massive scale of the system. Given this, using AIOps to continuously monitor health metrics is fundamental to running a cloud system successfully, as we have shared in our earlier posts. First, we shared more about this inAdvancing Azure service quality with artificial intelligence: AIOps. We also shared an example deep dive of how we use AIOps to help Azure in the safe deployment space inAdvancing safe deployment with AIOps. Today, we share another example, this time about how AI is used in the field of anomaly detection. Specifically, we introduce AiDice, a novel anomaly detection algorithm developed jointly by Microsoft Research and Microsoft Azure that identifies anomalies in large-scale, multi-dimensional time series data. AiDice not only captures incidents quickly, it also provides engineers with important context that helps them diagnose issues more effectively, providing the best experience possible for end customers.
Why are AIOps needed for anomaly detection?
We need AIOps for anomaly detection because the data volume is simply too large to analyze without AI. In large-scale cloud environments, we monitor an innumerable number of cloud components, and each component logs countless rows of data. In addition, each row of data for any given cloud component might contain dozens of columns such as the timestamp, the hardware type of the virtual machine, the generation number, the OS version, the datacenter where the nodes hosting the virtual machine stay in, or the country. The structure of the data we have is essentially multi-dimensional time series data, which contains an exponential number of individual time series due to the various combinations of dimensions. This means that iterating through and monitoring every single time series is simply not practical—applying AIOps is necessary.