Introducing Metrics Advisor - A new Cognitive Service

Microsoft

Sep 22, 2020

It is key to stay on top of the status of the physical assets, products, services, and business through data intelligence for companies and organizations which are embracing digital transformation. The way they are doing this is by extracting the key metrics which are proxies to those assets and monitoring the metrics 24X7. And if there is anything wrong detected, they would like to know immediately and act on that to prevent the small issues from becoming customer-impacting incidents. This becomes difficult when the data volume is huge, therefore identifying objects, groups of objects, events, or event patterns that deviate from the expected or norm with scale is needed.

We are pleased to announce the preview of the Metrics Advisor, part of Azure Cognitive Services to address the need for metrics intelligence. The service ingests data from various sources, using machine learning to automatically find anomalies from sensors, products, and business metrics, and provides diagnostics insights. Metrics Advisor goes beyond simple Anomaly Detection by providing developers an out-of-the-box platform of multi-dimensional metric data ingestion, anomaly detection, and automatic model customization through user feedback powered by reinforcement learning. The capabilities of the pipeline can be easily used by developers to build predictive maintenance, AIOps (artificial intelligence for IT operations), and business metric monitoring solutions.

Overview

Metrics Advisor leverages sophisticated mathematical techniques — machine learning and other advanced analytics — to precisely detect more-subtle anomalies, provide earlier notice of likely future anomalies, and streamline the design and development of systems that detect (and even act on) anomalies. Built on the Anomaly Detector, it includes the capabilities to ingest data from various standard data sources, use the data to build out models, model tuning, and feedback-based model customization behind the scenes. Last but not least root cause analysis with advanced insights & recommend actions.

What Can You Do with Metrics Advisor

Let me show you what kind of problems you can solve with the Metrics Advisor. Imagine you are the someone responsible for company Contoso’s e-commerce website. To ensure both the business and services are in good health, many important business metrics and infrastructure metrics are generated and onboarded to Metrics Advisor, e.g., DAU (daily active users), CPU usage, web page latency, database throughput…
At a given date, DAU anomaly was detected on the all-up metric aggregated by all regions and all channels and you got a notification.

At this moment, automated diagnostics info was available on the portal or via APIs. It turned out that the leading contributors to this anomaly were from the region of the United States and the channel of Direct. So the investigation and mitigation should focus on those areas to start with.

Were there any other underlying issues you should look into? By checking out the Metrics graph which depicts the metrics dependency across the infrastructure metrics, it became obvious that the MySQL problem caused the latency issue of the web app, which propagated to impact the DAU of specific region and channel.

You as the engineering lead of the Contoso e-commerce website, can easily get automated insight within a few minutes and identify the potential root cause. All the operations can be done via portal or APIs if you would like to embed this capability into your org’s own experience. After Ignite 2020, we will launch SDKs to ease your coding with the APIs.

Magic Behind the Scenes

Model selection framework & tuning

Firstly, the time-series anomaly detection task is challenging because of the complex characteristics of time-series, which are messy, stochastic, and often without proper labels. This prohibits training supervised models because of lack of labels and a single model hardly fits different time series.
We present an automated model selection framework to automatically find the most suitable detection model with proper parameters for the incoming data. The model selection layer is extensible as it can be updated without too much effort when a new detector is available to the service. Finally, we incorporate a customized tuning algorithm to flexibly filter anomalies to meet customers’ criteria. Experiments on real-world datasets show the effectiveness of our solution.
As shown in the pipeline below, the incoming series is first processed by a set of transformations and feature extractors. Then in the automated model selection phase, Model Selector takes the extracted features as input and outputs the anomaly detection model that best fits the input data. Each anomaly detection model is associated with a Parameter Estimator, which is used to compute related parameters. Next, our service uses the selected model and its corresponding parameters to detect anomalies of the input data and obtains a preliminary anomaly detection result. Lastly, tuned parameters are applied to obtain a customized anomaly detection result.

Model customization through user feedback

Tuning is one way to customize the model to users’ business and dataset. We are also leveraging user feedback as human knowledge to adapt the model/parameters to better serve and fit customers’ data and business.

The motivation is that different customers have different service patterns and anomaly definitions. It is complicated for customers to tune the model directly to be adapted to their scenarios. To solve this problem, we provide a feedback mechanism in the Metrics Advisor. Customers can obtain more accurate detection results by providing confirmation on these results which are fed into an end-to-end framework based on Reinforcement Learning (RL) to learn customer feedbacks.

Adaptive root cause analysis

Root cause analysis consists of two major processes,

Automation part: From the nature of the metrics, such as hierarchy and distribution, using machine learning to generate an analysis report to find out the most likely root causes in an incident.
Online Learning part: From the data topology, feedbacks, and interactions of the incident owner, inference the severity of an incident to make sure the alerts are actionable and reduce ignorable information in analysis reports.

The learning part will ingest strategies into the automation part to improve the quality of the analysis report and alerts.

While there are a bunch of automation technologies are available for root cause analysis, Metrics Advisor takes customization into account. Because the Metrics Advisor is a general cloud service but customers’ scenarios are diverse, similar incidents might mean differently for different customers, or for the same service in different stages. The capabilities of learning from customers’ feedbacks and implicitly being tuned to be more actionable are the biggest advantage of Metrics Advisor. To implement that, the Metrics Advisor creatively combines online learning technologies and incident representation to achieve a self-evolvable Root Cause Analysis.