Final Update: Tuesday, 07 April 2020 21:53 UTC
We've confirmed that all systems are back to normal with no customer impact as of 3/12, 17:10 UTC. Our logs show the incident started on 3/10, 21:00 UTC and that during the 1 day, 20 hours, 10 minutes that it took to resolve the issue we have revised our conclusion that customers were not impacted, instead we have determined that a small percentage of customers experienced intermittently latent, duplicated, or missing availability test results.
- Root Cause: The failure was due to human error during a manual configuration of the service to facilitate service scale-out.
- Lessons Learned: In addition to SDK updates, we are implementing several items. Ingestion code changes to better handle errors in this situation, updating the processes for this procedure in the future, as well as adding new logging for better customer identification and faster detection.
- Incident Timeline: 1 Day, 20 Hours & 10 Minutes - 3/10, 21:00 UTC through 3/12, 17:10 UTC
We understand that customers rely on Application Insights as a critical service and apologize for any impact this incident caused.
-Jeff