Final Update: Monday, 29 July 2019 15:06 UTC
We've confirmed that all systems are back to normal with no customer impact as of 7/29, 13:58 UTC. Our logs show the incident started on 7/29, 07:10 UTC and that during the 7 hours that it took to resolve the issue very small percentage of customers having their data hosted on impacted backend cluster would have experienced data access issues and log alerts might not have worked as expected.
- Root Cause: Initial RCA suggests that there was a combination of recent deployment and very expensive queries hitting the performance of the cluster.
- Lessons Learned: We have collected required telemetry with perf dumps and are currently looking through the same to look for final RCA and avoid recurrence.
- Incident Timeline: 6 Hours & 48 minutes - 7/29, 07:10 UTC through 7/29, 13:58 UTC
We understand that customers rely on Azure Log Analytics as a critical service and apologize for any impact this incident caused.
-Durga