DNS resolution issues when attempting to connect to login.microsoftonline.com

Copper Contributor

I work for an ISP and starting recently our users have been experiencing issues when trying to reach certain Microsoft sub-domains. The most recent examples were passwordreset.microsoftonline.com, login.microsoftonline.com or portal.office.com.

The issues happen relatively regularly but inconsistently/intermittently. We have narrowed it down to an issue at the home router levels.

Our DNS servers have both UDP and TCP port 53 opened and all dns requests are returned properly.

 

However, if the device uses it's DHCP assigned DNS values, which is the home router default gateway (usually a 192.168.0.1 type address) acting as a DNS forwarder, the DNS request times out.

 

After running a few packet captures under a few different conditions, here is what I found:

 

In all scenarios the DNS requests start with a UDP port 53 DNS lookup request which goes through fine and the computer receives a truncated answer, but instead of displaying the results of the truncated answer, a second DNS request, this time using TCP port 53 is sent.

 

When polling the DNS server directly (DNS server settings set statically), the DNS lookup results get displayed once the DNS server respond to the request

 

However when using the router as a forwader, using the default/out of the box router DHCP-assigned DNS values (the router acts as both the DHCP server and assigns its own gateway address as the DNS server when offering a DHCP lease and DHCP configuration options), the router seems to reject the TCP port 53 DNS lookup packets and responds with RST flagged packets without forwarding this TCP DNS request, leading to a timeout and the inability to load the pages in question.

I understand that the UDP to TCP switch happens when a DNS record is larger than 512 Bytes and tat is the reason why the request is sent again that way but do not understand why the truncated answers are not used when the full answer cannot be obtained. The issue occurs on Windows, MacOSx & iOS using all sorts of browsers such as firefox, chrome, safari, ie, edge, etc... as well as when just attempting nslookups for those domains from the OS terminals.

Also tested on Android using the Termux app, and pointing nslookup to the gateway address. No issues with the browsers as it looks like android uses its own DNS settings and seems to ignores the DHCP provided DNS configuration (at least on the test device).

 

What can we do to ensure that our customers can access those pages without encountering this issue and without having to fully reconfigure their routers to provide custom DNS settings, which would be pretty backwards and would cause multiple issues whenever a router is reset to factory default?

1 Reply

Just as an FYI for anyone that may be looking into any similar issue, we have figured out exactly what is happening.

As suspected, the routers or Residential Gateways in question were not listening to TCP port 53 on their Default Gateway Address, which is as good as blocking it on their.

It appears that several brands have flawed firmware.

The only option is to either configure custom DNS settings to be served to the clients on the router, search for a firmware updated that fixes this issue. We opened a few tickets with some of our hardware vendors to have them investigate and fix the issue. One of those vendors is working on updating their firmware and will likely push a firmware upgrade globally in the months to come. Luckily their type of hardware allow for this upgrade to easily be pushed to all devices at once.

 

If you are experiencing this issue, look for firmware upgrades, if they exist, open a ticket with the vendor's customer service if you can, customize your router settings to serve a few custom DNS servers, or get a new router if all else fails.