Issues with Network Name Resolution

Occasional Contributor

Following a request to disable RDP for NNR, MS Support states telemetry data for our MDI deployment failure rates for RDP is 45% and 77% NetBIOS. I do not have any health alerts for low name resolution at this point and support did not indicate there was an issue, but they recommend NOT disabling RDP as an option.

 

This does not make sense as RDP is certainly more restricted in our environment and thus should show more failure.  Failures could occur due to devices being off/asleep, not on network/VPN where rules don't allow for specific communication (RDP ex.) In these WFH times, this condition has to be more widespread/commonplace.

 

A few clarification questions would be helpful to troubleshoot this condition if not closed due to secret sauce concerns:

  • What is the timing (business hours vs all hours) and frequency of these NRR requests 
  • Is there an order of preference for the 3 methods from a system use perspective (not a degree of certainty) or are they all used all the time? If not, randomly?
  • How impactful is DNS service update config (cycle of 1 hour vs 2 etc.)
  • Any undocumented plans for changes or alternatives/provisions to NetBIOS, RDP or RPC? Use of MDE or Intune etc.
  • Does the this process resolve IP to Name or Name to IP or both?

Thanks in advance

5 Replies

I should have included that NNR failures could also occur due to DNS inaccuracy (for whatever reason.)

@MarshMadness 


Support has a good point, and data is not laying.


Note though that those stats are not saying 45% of observed IPs fail to resolve.
It means 45% of resolve attempts.
So if you have little amount of IPs doing most of the traffic over time, and they resolve fine, it will increase your success rate in this metric...
Also the other way around, if you have little IPs that fail to resolve but doing most of the traffic, 
they will lower your rate in this metric...

I know - having per IP aggregated stats would be nice, but it comes with a price that we currently try to avoid, but support can also turn on tracing for you for a few hours to also give you a significant amount of examples which succeed/fail for each method which might also help you understand what is going on in the network. We usually turn those on only for a few hours, as this is extremely verbose and heavy...


What about NTLMRpc ? what was the success rate there?
are NTLM RPC and Netbios blocked most of the time? if yes, it could be that RDP is doing better...

- Timing: NNR will take place probably seconds within the time the endpoint initiated contact with the DC, so most chances it's not asleep or disconnected.

- all high certainty active methods run in parallel
- DNS is low certainty , so we try to relay on it as low as possible, we don not support relying just on DNS based NNR.
- As for alternatives - Yes, there are plans, we have methods via other protocols as well, some of them are already actively used in private preview customers, and once we tune it enough , will be used for everyone and offer more flexibility.


- We are considering using MDE as well, but this is currently considered only in theory and far from a decision if will be eventually effective, and if yes, it will take a significant time to implement.
- The process resolves IP to name

Thank you so much @Eli Ofek for your response, it is very helpful and greatly appreciated.

I did not mean to come across as doubting the recommendation or the data itself, in fact I do agree with it, I just have a hard time understanding how RDP could be a lower failure rate than NetBIOS.
TCP135 and UDP137 are allowed to all but a few clients that the sensors/DC's are attempting to resolve.  RDP is blocked to anything on VPN or behind another firewall so therefore much more restricted.  Where this is the case (no NetBIOS, NTLMRpc or RDP) is there any alternative to attaining higher certainty? 


Support did not provide NTLMRpc rates so assuming OK otherwise but have asked for numbers. I have also asked support to look at turning up trace level logging.  I will update this thread once i know more

So if I understand you correctly, NRR is a result of a client contacting a DC.  That is good to know.  Also glad to hear there are alternate protocols being worked and tuned.  Hopefully something there that can be used for those types of devices\appliances that cannot\will not be able to return data back for NRR.

@Eli Ofek 

Still working with support on getting trace logging to troubleshoot high NetBIOS failure rate but I think we have a good portion nailed down to VPN clients.

We allow NetBIOS and NTLMrpc outbound thru the enterprise firewall and VPN to VPN clients but are blocking inbound UDP137 on the local firewall.  I see many drops in those logs and they, from a time perspective,  "loosely" correlate to UDP outbound from corporate.  Source and destination port in the log are UDP137.

Admittedly I am not a protocol expert, but I find a few things odd:

  • the source and destination of these drops are the client IP
  • i see about 25 drops in the client FW for every inbound from corp DC
  • the timing of the drops vary from 20 - 60 seconds off corp FW timestamp (not consistent enough to state it affirmative as time difference between them)

Any thoughts or is this expected behavior


Here is a sample event from the local firewall.

datetimeactionprotocolsrc-ipdst-ipsrc-portdst-portsizetcpflagstcpsyntcpacktcpwinicmptypeicmpcodeinfopath
2/10/20211:38:46DROPUDPx.x.x.xx.x.x.x1371370-------SEND

@MarshMadness 

Not sure what clues this is giving us.

Assuming your DC/Sensor machine is in corp network, and the target endpoint is a VPN client, the flow should be:

VPN client endpoint authenticates to the DC.

In response, the Sensor sense this connection, and respond back to the VPN client endpoint IP with a netbios request (this should happen within seconds).

it might try that 2 times if it doesn't get a response.

I don't see how this correlates with the numbers you mentioned...