Experiencing Alerting failure for Metric Alerts - 12/15 - Resolved
Published Dec 15 2020 02:19 PM 1,723 Views
Final Update: Wednesday, 16 December 2020 07:53 UTC

We've confirmed that all systems are back to normal with no customer impact as of 12/16, 07:35 UTC. Our logs show the incident started on 12/01, 23:00 UTC and that during the 14 days 8 Hours 35 Minutes that it took to resolve the issue some of customers may have experienced duplicate notifications for unhealthy metric alerts that only contain non-ascii characters in the name of the alert.
  • Root Cause: We identified the issue was caused by a recent deployment that introduced a configuration incompatibility between dependent services.
  • Incident Timeline: 14 days 8 Hours & 35 minutes - 12/01, 23:00 UTC through 12/16, 07:35 UTC
We understand that customers rely on Metric Alerts as a critical service and apologize for any impact this incident caused.

-Vamshi

Update: Wednesday, 16 December 2020 06:27 UTC

Root cause has been isolated to new feature rollout which was causing duplicate notification when resource was unhealthy for some customer in the azure portal. Specifically, any metric alert names with non ascii characters see this impact. To address the issue, we've started mitigation to disable the feature. Some customers may still continue to see notifications and we estimate 4 hours before all is addressed.
  • Mitigation Steps: Redeploying with new feature to disable the alert notification is in progress
  • Next Update: Before 12/16 10:30 UTC
-Vamshi

Update: Wednesday, 16 December 2020 02:45 UTC

Root cause has been isolated to new feature rollout which was causing duplicate notification when resource was unhealthy for some customer in the azure portal. Specifically, any metric alert names with non ascii characters see this impact. To address the issue, we've started mitigation to disable the feature. Some customers may still continue to see notifications and we estimate 4 hours before all is addressed.
  • Mitigation Steps: Redeploying with new feature disabled is in progress.
  • Next Update: Before 12/16 07:00 UTC
-Vincent

Update: Tuesday, 15 December 2020 22:17 UTC

Root cause has been isolated to new feature rollout which was causing duplicate notification when resource was unhealthy for some customer in the azure portal. Specifically, any metric alert names with non ascii characters see this impact. To address the issue, we've started mitigation to disable the feature. Some customers may continue to see notifications and we estimate 4 hours before all is addressed.
    Root Cause: New feature rollout which was causing duplicate notification.
    Incident Timeline:  14 day 23 hours - 12/01, 23:00 UTC through on going 12/15, 22:00 UTC
    Next Update: Before 12/16 01:30 UTC

-Vincent

Version history
Last update:
‎Dec 16 2020 12:05 AM
Updated by: