Experiencing Alerting failure for Metric Alerts - 05/28 - Resolved
Published May 28 2020 01:51 PM 1,172 Views
Final Update: Thursday, 28 May 2020 22:08 UTC

We've confirmed that all systems are back to normal with no customer impact as of 5/28, 21:56 UTC. Our logs show the incident started on 5/28, 19:24 UTC and that during the 2 1/2 hours that it took to resolve the issue less than 1% of customers experienced delayed or missing alerts.
  • Root Cause: The failure was due to an outage in a back end data ingestion system and complicated by a watchdog system that was reporting errors incorrectly.
  • Incident Timeline: 2 Hours & 32 minutes - 5/28, 19:24 UTC through 5/28, 21:56 UTC
We understand that customers rely on Metric Alerts as a critical service and apologize for any impact this incident caused.

-Jack

Update: Thursday, 28 May 2020 21:53 UTC

Root cause has been isolated to a failure in a back end system due to a watchdog falsely reporting a problem and blocking legitimate traffic. To address this issue we are repairing the faulty watchdog. Some customers may continue to experience delayed and missing metric alerts and we estimate an hour before all delayed and missing alerts have ceased.
  • Work Around: none
  • Next Update: Before 05/28 23:00 UTC
-Jack

Initial Update: Thursday, 28 May 2020 20:50 UTC

We are aware of issues within Metric Alerts and are actively investigating. Some customers may experience alerts not firing or firing delayed. This problem began at 19:24 UTC on 5/28.
  • Work Around: none at this time
  • Next Update: Before 05/28 22:00 UTC
We are working hard to resolve this issue and apologize for any inconvenience.
-Jack



Version history
Last update:
‎May 28 2020 03:12 PM
Updated by: