%3CLINGO-SUB%20id%3D%22lingo-sub-1313948%22%20slang%3D%22en-US%22%3EExperiencing%20Alert%20rule%20management%20failure%20for%20Metric%20Alerts%20-%2004%2F16%20-%20Resolved%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-1313948%22%20slang%3D%22en-US%22%3E%3CDIV%20style%3D%22font-size%3A14px%3B%22%3E%3CDIV%20style%3D%22font-size%3A14px%3B%22%3E%3CDIV%20style%3D%22font-size%3A14px%3B%22%3E%3CU%3EFinal%20Update%3C%2FU%3E%3A%20Thursday%2C%2016%20April%202020%2019%3A42%20UTC%3CBR%20%2F%3E%3CBR%20%2F%3EWe've%20confirmed%20that%20all%20systems%20are%20back%20to%20normal%20with%20no%20customer%20impact%20as%20of%204%2F16%2C%2017%3A40%20UTC.%20Our%20logs%20show%20the%20incident%20started%20on%204%2F16%2C%2011%3A14%20UTC%20and%20that%20during%20~6%20hours%2020%20min%3CDURATION%20incident%3D%22%22%20of%3D%22%22%3Ethat%20it%20took%20to%20resolve%20the%20issue%20some%20customers%20experienced%20alert%20rule%20management%20failures%20for%20metric%20alerts%20on%20redis%20cache%20metrics.%3CBR%20%2F%3E%3CUL%3E%0A%20%3CLI%3E%3CU%3ERoot%20Cause%3C%2FU%3E%3A%20The%20failure%20was%20due%20to%20one%20of%20the%20backend%20services%20using%20incorrect%20configuration.%3C%2FLI%3E%0A%20%3CLI%3E%3CU%3EIncident%20Timeline%3C%2FU%3E%3A%206%20Hours%20%26amp%3B%2020%20minutes%20-%204%2F16%2C%2011%3A14%20UTC%20through%204%2F16%2C%2017%3A40%20UTC%3C%2FLI%3E%0A%3C%2FUL%3EWe%20understand%20that%20customers%20rely%20on%20Metric%20Alerts%20as%20a%20critical%20service%20and%20apologize%20for%20any%20impact%20this%20incident%20caused.%3CBR%20%2F%3E%3CBR%20%2F%3E-Anupama%3CBR%20%2F%3E%3C%2FDURATION%3E%3C%2FDIV%3E%3CHR%20style%3D%22border-top-color%3Alightgray%22%20%2F%3E%3C%2FDIV%3E%3C%2FDIV%3E%3C%2FLINGO-BODY%3E%3CLINGO-LABS%20id%3D%22lingo-labs-1313948%22%20slang%3D%22en-US%22%3E%3CLINGO-LABEL%3EMetric%20Alerts%3C%2FLINGO-LABEL%3E%3C%2FLINGO-LABS%3E
Final Update: Thursday, 16 April 2020 19:42 UTC

We've confirmed that all systems are back to normal with no customer impact as of 4/16, 17:40 UTC. Our logs show the incident started on 4/16, 11:14 UTC and that during ~6 hours 20 min that it took to resolve the issue some customers experienced alert rule management failures for metric alerts on redis cache metrics.
  • Root Cause: The failure was due to one of the backend services using incorrect configuration.
  • Incident Timeline: 6 Hours & 20 minutes - 4/16, 11:14 UTC through 4/16, 17:40 UTC
We understand that customers rely on Metric Alerts as a critical service and apologize for any impact this incident caused.

-Anupama