External monitoring shows outage in multiple regions & service types. Azure shows no outage.

Copper Contributor

I'm using a service called Monitis to monitor the uptime of some of my web-based resources. Basically, it pings the services from three geographic locations (West US, East US, and Mid US) and raises an alert if two or more them encounter ping times of more than 10 seconds for an extended period of time.

 

On Saturday, three of my resources, all based in Azure, registered an 18-minute outage from all three ping locations at the same time:

Outages.png

(The times above are in the Japan time zone. This equates to 4:10-4:28am Pacific, Oct. 21)

 

Of these,

[green] is the hostname for two identical web apps, one in West US and one in East US, balanced using traffic manager. The error in Monitis includes the IP address for the East US service, so it seems that the hostname was resolving to the US East service when Monitis tried to ping it.

[purple] is a Web app in North Central US scaled out to two S1 instances

[blue] is a VM in East US

 

I've checked the monitoring charts within Azure for the two web apps and neither shows any downtime during the specified time period. Both show requests coming in and going out during the time period and no instance restarts. [green] has a slight rise in activity during the time period, but nothing out of the ordinary.

 

The VM says that it has been up since September, and doesn't show anything unusual in the System event log during this time period.

 

All three of these resources are unrelated to each other and have no interdependencies.

 

My questions:

1. Is there any way to find out what happened here? As stated above, Azure indicates no interruption in activity, but it very much seems that there was an interruption.

2. Why would Monitis show an 18-minute outage on multiple types of services in multiple Azure regions? If there was an interruption in Azure's network infrastructure during that time, there's no sign of it in the Azure status history. It's also strange that the web apps both seem to report receiving and serving requests during the supposed outage.

3. The service marked in [green] is set up in Traffic manager with an identical service in US-West, so presumably Monitis should have been redirected to the US-West service when the US-East service became inaccessible, but it seems like this didn't happen. Can you think of why this didn't work? It would make sense if Azure thought that the service was healthy the whole time, but how can I handle a situation with one region becoming inaccessible if traffic manager doesn't redirect the traffic?

 

Thank you for any insight or help you can give.

0 Replies