We've confirmed that all systems are back to normal with no customer impact as of 3/12, 17:10 UTC. Our logs show the incident started on 3/10, 21:00 UTC and that during the 1 day, 20 hours, 10 minutes that it took to resolve the issue we have revised our conclusion that customers were not impacted, instead we have determined that a small percentage of customers experienced intermittently latent, duplicated, or missing availability test results.
Root Cause: The failure was due to human error during a manual configuration of the service to facilitate service scale-out.
Lessons Learned: In addition to SDK updates, we are implementing several items. Ingestion code changes to better handle errors in this situation, updating the processes for this procedure in the future, as well as adding new logging for better customer identification and faster detection.
Incident Timeline: 1 Day, 20 Hours & 10 Minutes - 3/10, 21:00 UTC through 3/12, 17:10 UTC
We understand that customers rely on Application Insights as a critical service and apologize for any impact this incident caused.
Final Update: Tuesday, 10 March 2020 22:48 UTC
We've confirmed that there was no customer impact from this incident. Thank you for your patience.
Initial Update: Tuesday, 10 March 2020 21:38 UTC
We are aware of issues within Application Insights availability tests and are actively investigating. Some customers may experience availability test failures.
Next Update: Before 03/11 00:00 UTC
We are working hard to resolve this issue and apologize for any inconvenience. -Jeff