Experiencing Data Gaps issue in Azure Portal for Application Insights - 06/17 - Resolved
Published Jun 17 2019 12:06 PM 1,077 Views
Final Update: Monday, 24 June 2019 20:48 UTC

We've confirmed that all systems are back to normal with no customer impact as of 6/18, 03:57 UTC. Our logs show the incident started on 6/17, 03:45 UTC and that during the 24 hours 12 minutes that it took to resolve the issue 3.5% of customers experienced data gaps, latency, and data access issues.
  • Root Cause: The failure was due to a misconfigured timeout value in configuration
  • Full Mitigation: We removed the impacted instances from rotation in order to mitigate immediate customer impact. We then performed a configuration change to increase timeouts between backend services and monitored service health before returning the instances to rotation. As of 18:57 UTC on 24 Jun 2019, engineers have confirmed full mitigation of this issue.
  • Incident Timeline:
  •     Customer Impact Mitigation: 24 Hours & 12 minutes - 6/17, 03:45 UTC through 6/18, 03:57 UTC
  •     Full Mitigation: 7 days 15 hours & 12 Minutes - 6/17, 03:45 UTC through 6/24, 18:57 UTC
We understand that customers rely on Application Insights as a critical service and apologize for any impact this incident caused.

-Jeff

Update: Monday, 24 June 2019 03:48 UTC
Teams are monitoring the performance of the configuration change to ensure no further customer impact will occur.
  • Next Update: Before 06/25 04:00 UTC
-Jeff

Update: Friday, 21 June 2019 18:26 UTC

A configuration change has been applied to all regions and teams are monitoring the result to ensure no customer impact. There is still no customer impact as of this time.
  • Next Update: Before 06/22 18:30 UTC
-Jeff

Update: Thursday, 20 June 2019 21:24 UTC

Teams are investigating another root cause and are actively working to validate a configuration change will completely mitigate the problem.
  • Next Update: Before 06/21 15:30 UTC
-Jeff

Update: Thursday, 20 June 2019 03:50 UTC

Dependent services are continuing to look for root cause.

-Jeff

Update: Wednesday, 19 June 2019 01:57 UTC

We have concluded the links between services are not an issue and dependent services are further investigating the root cause. 
  • Next Update: Before 06/19 20:00 UTC
-Jeff

Update: Tuesday, 18 June 2019 19:36 UTC

Root cause is being further investigated as an issue in communication with the dependent service which is impacting Application Insights. We are investigating the links between services.

  • Next Update: Before 06/19 00:00 UTC

-Jeff

Update: Tuesday, 18 June 2019 12:30 UTC

Root cause has been isolated to one of our dependent service which was impacting Application Insights. Some customers may experience Data gaps and we estimate 6 hours before the issue is completely resolved.
  • Work Around: None
  • Next Update: Before 06/18 18:30 UTC
-Monish

Update: Tuesday, 18 June 2019 00:39 UTC

We continue to investigate issues within Application Insights. Root cause is not fully understood at this time. Some customers  in Central US and East US continue continue to experience Data gaps. initial findings indicate that the problem began at 06/17, 18:40 UTC. We currently have no estimate for resolution.
  • Work Around: None
  • Next Update: Before 06/18 13:00 UTC
-Jeff Miller

Final Update: Monday, 17 June 2019 22:17 UTC

We've confirmed that all systems are back to normal with no customer impact as of 6/17, 21:30 UTC. Our logs show the incident started on 06/17, 18:40 UTC and that during the ~ 2 hours that it took to resolve the issue, customers in East US and Central US experienced Data gaps while accessing Application data.
  • Root Cause: The failure was due to some instances of backend service experienced increased latency, causing a subset of requests to time out.
  • Lessons Learned: Engineers will continue to investigate to establish the cause of the increased latency and prevent future occurrences
  • Incident Timeline:  ~ 2 hours 10 min 06/17, 18:40 UTC  through 06/17,  21:30 UTC
We understand that customers rely on Application Insights as a critical service and apologize for any impact this incident caused.

-Anupama

Update: Monday, 17 June 2019 21:00 UTC

We continue to investigate issues within Application Insights. Engineers are working on determining the root cause. Some customers in Central US and East US continue to experience Data gaps. Initial investigation indicates that the issue could have been started due to high resource utilization by one of the backend services. Initial findings indicate that the problem began at  06/17 18:40 UTC. 
  • Work Around: None
  • Next Update: Before 06/17 23:00 UTC
-Anupama

Initial Update: Monday, 17 June 2019 18:52 UTC

We are aware of issues within Application Insights and are actively investigating. Some customers in East US may experience Data Gaps while accessing on Azure portal.
  • Work Around: None
  • Next Update: Before 06/17 21:30 UTC
We are working hard to resolve this issue and apologize for any inconvenience.
-Anupama

Version history
Last update:
‎Jun 24 2019 02:28 PM
Updated by: