We've confirmed that all systems are back to normal with no customer impact as of 03/17, 01:45 UTC. Our logs show the incident started on 03/17, 00:42 UTC and that during the one hour that it took to resolve the issue approximately 50% customers in the East US region experienced data ingestion latency and possibly delayed or misfired alerts.
Root Cause: The failure was due to a back end system reaching its operational threshold. Once the system was restarted and scaled out, the backlog of ingestion data began to drain.
Incident Timeline: 1 hour and 3 minutes - 03/17, 00:42 UTC through 03/17, 01:45 UTC
We understand that customers rely on Azure Log Analytics as a critical service and apologize for any impact this incident caused.
Update: Wednesday, 17 March 2021 01:39 UTC
Root cause has been isolated to a back end system reaching its system limits which caused ingestion to back up and result in latency and possible missed or late-firing alerts. To address this issue we scaled out the back end service. The ingestion back up is mostly resolved and ingestion will shortly be back working as expected. Some customers may continue to experience data latency and improperly fired alerts. We estimate less than one hour before all latency in data and alerting is addressed.