Final Update: Thursday, 28 May 2020 22:08 UTC
We've confirmed that all systems are back to normal with no customer impact as of 5/28, 21:56 UTC. Our logs show the incident started on 5/28, 19:24 UTC and that during the 2 1/2 hours that it took to resolve the issue less than 1% of customers experienced delayed or missing alerts.
-
Root Cause: The failure was due to an outage in a back end data ingestion system and complicated by a watchdog system that was reporting errors incorrectly.
- Incident Timeline: 2 Hours & 32 minutes - 5/28, 19:24 UTC through 5/28, 21:56 UTC
We understand that customers rely on Metric Alerts as a critical service and apologize for any impact this incident caused.
-Jack