Howdy everyone! I'm back to talk about one of my favorite causes of heartache, the domain name system (DNS). This will be our first foray into an application layer protocol. The concept of DNS is simple enough, but it can lead to some confusing situations if you don't keep its function in mind. No time to waste, let's get going!
What's the deal with DNS?
The core of DNS is that it is used like a key-value pair. You ask a DNS server for information with a key, and it provides the value. This can be things like:
- Computer names to IPv4 addresses (A records)
- Computer names to IPv6 addresses (AAAA, often called Quad A records)
- Who is authoritative for a DNS zone (Nameserver or NS records)
- IP addresses to computer names (PTR or pointer records)
- Or even just to some text! (TXT records)
This is by no means an exhaustive list of the types or records and in this post, we will be focusing on Host A and records as those are the bread and butter of DNS issues.
The general idea is that a DNS server will act as a phone book. I have so-and-so’s name, but I want their address or their phone number, and through DNS you can get that information. And the best part is that a DNS server must return a response. That response may be "Hey I don't know" but it is still a response.
I want to take a second to dig (see what I did there?) a bit deeper into the Host A record as that is the primary record type we will be talking about in this post.
A Host A record is a key containing the Fully Qualified Domain Name (FQDN) for a resource. This resource will follow the format of <computer name>.<child domain>.<parent domain> finally terminating with a ‘.’. So, for example a machine in my lab may be: member.child.contoso.com
Where:
- The computer name is member
- The child domain is child
- The parent domain is contoso.com
You could use something like Windows Internet Name Service (WINS) (please don't use WINS) or NetBIOS name resolution to resolve the name using just the contoso single label but let's stick with DNS for now.
How to use DNS
DNS operates in a query response format. A client sends out a DNS query for a specific record and the DNS server responds to that query. One of the large benefits of DNS is that a DNS server can recursively search the DNS zones that it is authoritative for when it receives a query. It is important to note that the DNS server MUST always respond to the DNS query even if it just responds with "I dunno go ask someone else". And this traffic will typically take place over UDP port 53 or TCP port 53.
Quick UDP side bar:
We have talked about TCP previously but we haven't talked about the User Datagram Protocol (UDP). Similar to TCP it uses a source and destination port pair, but UDP is an unreliable protocol. If a UDP packet is dropped, there is no attempt to retransmit the dropped packet. Due to this lack of protocol overhead this protocol can be fast but relies upon upper layer network protocols for all the reliability of the information.
Here is an example of an outbound DNS query for bing.com.
Internet Protocol Version 4, Src: 192.168.7.95, Dst: 192.168.1.254 User Datagram Protocol, Src Port: 36187, Dst Port: 53 Domain Name System (query) Transaction ID: 0x7523 Flags: 0x0100 Standard query Questions: 1 Answer RRs: 0 Authority RRs: 0 Additional RRs: 0 Queries bing.com: type A, class IN Name: bing.com [Name Length: 8] [Label Count: 2] Type: A (Host Address) (1) Class: IN (0x0001) [Response In: 31]
Starting from top to bottom within the DNS section:
- Transaction Id: A semi-unique identifier for the DNS query
- The numbers are often reused but not in quick succession.
- Flags: Additional options for the DNS query
- This includes things like do we want recursion, what the operation is, etc...
- Questions: The number of queries we are performing in this request
- Queries: This contains the specifics of the Host A record we are looking for:
- Name: The resource we are looking for
- Type: The record type we are looking for
- Class: This will nearly always be IN for internet
And the response follows much of the same format:
Internet Protocol Version 4, Src: 192.168.1.254, Dst: 192.168.7.95 User Datagram Protocol, Src Port: 53, Dst Port: 36187 Domain Name System (response) Transaction ID: 0x7523 Flags: 0x8180 Standard query response, No error Questions: 1 Answer RRs: 2 Authority RRs: 8 Additional RRs: 17 Queries bing.com: type A, class IN Answers bing.com: type A, class IN, addr 13.107.21.200 Name: bing.com Type: A (Host Address) (1) Class: IN (0x0001) Time to live: 429 (7 minutes, 9 seconds) Data length: 4 Address: 13.107.21.200 bing.com: type A, class IN, addr 204.79.197.200 Authoritative nameservers Additional records [Request In: 30] [Time: 0.058221000 seconds]
Notice how the transaction ID is the same between the query and response meaning that if we wanted to easily follow DNS query in a packet capture, we could filter using that as our identifier.
One of the key things to call out within the Answers section is the Time to Live (TTL). Not to be confused with the TCP TTL which dictates the number of hops a packet can take before it expires, the DNS TTL is the maximum amount of time that a DNS record should be cached for. Meaning that if another query for the same resource were to occur within that TTL limit, the client OS should use its DNS cache instead of sending that query out onto the network. After the TTL has elapsed or the DNS cache is cleared out, the client will need to query the network for that record.
Don't use NSLookup
If you've done any DNS work in the past you may have leveraged the tool nslookup. While this tool does perform DNS queries, it is not representative of how Windows resolves DNS queries.
NSlookup is a self-contained executable that does not leverage the Windows DNS client resolver. Its behavior doesn't match the OS.
Don't use it.
If you would like to perform DNS queries from the command line, I recommend using the PowerShell cmdlet, Resolve-DnsName which does use the native Windows DNS Client resolver.
Make sense? Let's jump into some captures.
Whose server is it anyway?
One of your colleagues comes to you in a panic and says "The website is down! I can't access it! Help me!!!!!". Jumping into action, you ask the following questions:
Being a network analysis pro, you start by asking the following questions:
- Q: When did this first start?
- A: 10 minutes ago!
- Q: What happened?
- A: We applied Windows Updates. The Windows updates must have broken the server!
- Q: Is it just one server or all servers?
- A: It's just the one web server that we can't access
As the rockstar you are, you collect a two-sided network trace during a reproduction of the issue.
Starting with the client, you can clearly see the following:
1024 21:30:10.279319 172.10.1.10 192.168.1.16 TCP 66 59468 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM 1140 21:30:11.279540 172.10.1.10 192.168.1.16 TCP 66 [TCP Retransmission] [TCP Port numbers reused] 59468 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM 1391 21:30:13.285009 172.10.1.10 192.168.1.16 TCP 66 [TCP Retransmission] [TCP Port numbers reused] 59468 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM 1798 21:30:17.285191 172.10.1.10 192.168.1.16 TCP 66 [TCP Retransmission] [TCP Port numbers reused] 59468 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM 2898 21:30:25.296353 172.10.1.10 192.168.1.16 TCP 66 [TCP Retransmission] [TCP Port numbers reused] 59468 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM
Alright, we don't complete the TCP handshake, let's check the server side... But you don't see anything from this server.
That's weird. Packets must have been dropped. But as you are getting ready to look into more testing your colleague says
"Oh, we restarted the client and it's fixed now. We are good. But the boss is gonna want root cause analysis (RCA). Can you figure it out?"
So, something happened and now it is working? Let's look at a working capture to see what we can see. You restart the client and get a new data set during a successful scenario. And you see the following:
82 18:29:41.806767 172.10.1.10 192.168.1.10 DNS 75 Standard query 0xd776 A iis.contoso.com 83 18:29:41.807743 192.168.1.10 172.10.1.10 DNS 91 Standard query response 0xd776 A iis.contoso.com A 192.168.1.17 94 18:29:42.108268 172.10.1.10 192.168.1.17 TCP 66 49348 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM 95 18:29:42.110052 192.168.1.17 172.10.1.10 TCP 66 80 → 49348 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=1460 WS=256 SACK_PERM 96 18:29:42.110098 172.10.1.10 192.168.1.17 TCP 54 49348 → 80 [ACK] Seq=1 Ack=1 Win=262656 Len=0 97 18:29:42.116019 172.10.1.10 192.168.1.17 HTTP 605 GET / HTTP/1.1 98 18:29:42.127368 192.168.1.17 172.10.1.10 TCP 54 80 → 49348 [ACK] Seq=1 Ack=552 Win=2097408 Len=0 100 18:29:42.159884 172.10.1.10 192.168.1.17 TCP 66 64385 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM 101 18:29:42.160830 192.168.1.17 172.10.1.10 TCP 66 80 → 64385 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=1460 WS=256 SACK_PERM 102 18:29:42.160869 172.10.1.10 192.168.1.17 TCP 54 64385 → 80 [ACK] Seq=1 Ack=1 Win=262656 Len=0
Wait a minute, what is the IP address in this capture? 192.168.1.17. And what is the IP address in the previous capture? 192.168.1.16. Why are they different? We can clearly see DNS returning 192.168.1.17 in frame 83. And why would a restart matter? What gives!
Let's take a closer look at the response in frame 83.
Domain Name System (response) Transaction ID: 0xd776 Flags: 0x8580 Standard query response, No error Questions: 1 Answer RRs: 1 Authority RRs: 0 Additional RRs: 0 Queries iis.contoso.com: type A, class IN Name: iis.contoso.com [Name Length: 15] [Label Count: 3] Type: A (Host Address) (1) Class: IN (0x0001) Answers iis.contoso.com: type A, class IN, addr 192.168.1.17 Name: iis.contoso.com Type: A (Host Address) (1) Class: IN (0x0001) Time to live: 1200 (20 minutes) Data length: 4 Address: 192.168.1.17 [Request In: 82] [Time: 0.000976000 seconds]
Okay, so we can see the request is for iis.contoso.com and we got a response to that Host A query with 192.168.1.17. But let's think a bit more about what the answer is telling us.
iis.contoso.com: type A, class IN, addr 192.168.1.17 Name: iis.contoso.com <----- We are answering the question for this name Type: A (Host Address) (1) <----- The response type is for a Host A (IPv4) record Class: IN (0x0001) <----- It is of type internet Time to live: 1200 (20 minutes) <----- The Time to live (TTL) is 20 minutes Data length: 4 Address: 192.168.1.17
TTL, didn't we talk about that earlier? We keep this DNS response in the DNS cache for 20 minutes or until it is cleared out. And it just so happens that a restart clears out the DNS cache. With that in mind you reach out to the owner of the iis.contoso.com server and ask:
- Q: "Did this server's IP address change recently?"
- A: "Yep, but it updated in DNS so it's all OK. What does that have to do with the outage earlier today?"
For even further proof you come up with the following plan:
- Change the server IP address
- Try and connect to the server
- View the DNS cache
- Clear the DNS cache with the PowerShell cmdlet Clear-DnsClientCache
- Try and connect to the server
- View the DNS cache
This should definitively tell us whether this is just an issue of the record existing in the DNS cache.
And here we go:
38 19:02:24.189559 172.10.1.10 192.168.1.10 DNS 75 Standard query 0xe883 A iis.contoso.com 39 19:02:24.191659 192.168.1.10 172.10.1.10 DNS 91 Standard query response 0xe883 A iis.contoso.com A 192.168.1.17 40 19:02:24.199492 172.10.1.10 192.168.1.17 TCP 66 64505 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM 41 19:02:24.200611 192.168.1.17 172.10.1.10 TCP 66 80 → 64505 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=1460 WS=256 SACK_PERM 42 19:02:24.200646 172.10.1.10 192.168.1.17 TCP 54 64505 → 80 [ACK] Seq=1 Ack=1 Win=262656 Len=0 43 19:02:24.202234 172.10.1.10 192.168.1.17 TCP 54 64505 → 80 [FIN, ACK] Seq=1 Ack=1 Win=262656 Len=0 44 19:02:24.203221 192.168.1.17 172.10.1.10 TCP 54 80 → 64505 [RST, ACK] Seq=1 Ack=2 Win=0 Len=0
And we can see the following in the DNS cache:
PS C:\> Get-DnsClientCache -Name iis.contoso.com
Entry RecordName Record Status Section TimeTo Data Data Type Live Length ----- ---------- ------ ------ ------- ------ ------ ---- iis.contoso.com iis.contoso.com A Success Answer 1176 4 192.168.1.17
And at this point the IP address of iis.contoso.com has been changed to 192.168.1.20. And testing again without clearing the DNS cache:
449 19:02:50.305599 172.10.1.10 192.168.1.17 TCP 66 64507 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM 462 19:02:51.315651 172.10.1.10 192.168.1.17 TCP 66 [TCP Retransmission] [TCP Port numbers reused] 64507 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM 494 19:02:53.331099 172.10.1.10 192.168.1.17 TCP 66 [TCP Retransmission] [TCP Port numbers reused] 64507 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM 565 19:02:57.331237 172.10.1.10 192.168.1.17 TCP 66 [TCP Retransmission] [TCP Port numbers reused] 64507 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM 675 19:03:05.346759 172.10.1.10 192.168.1.17 TCP 66 [TCP Retransmission] [TCP Port numbers reused] 64507 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM
We re-use 192.168.1.17 which makes sense since it is still in our DNS cache. We then clear the DNS cache by running Clear-DnsClientCache.
PS C:\Users\Administrator> Clear-DnsClientCache
And test again:
1017 19:03:26.376168 172.10.1.10 192.168.1.10 DNS 75 Standard query 0xcb93 A iis.contoso.com 1018 19:03:26.377355 192.168.1.10 172.10.1.10 DNS 91 Standard query response 0xcb93 A iis.contoso.com A 192.168.1.20 1024 19:03:26.472606 172.10.1.10 192.168.1.20 TCP 66 64508 → 80 [SYN] Seq=0 Win=64240 Len=0 MSS=1460 WS=256 SACK_PERM 1025 19:03:26.473927 192.168.1.20 172.10.1.10 TCP 66 80 → 64508 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=1460 WS=256 SACK_PERM 1026 19:03:26.473981 172.10.1.10 192.168.1.20 TCP 54 64508 → 80 [ACK] Seq=1 Ack=1 Win=262656 Len=0
And this is successful, looking in the DNS cache we see the correct IP address.
PS C:\> Get-DnsClientCache -Name iis.contoso.com Entry RecordName Record Status Section TimeTo Data Data Type Live Length ----- ---------- ------ ------ ------- ------ ------ ---- iis.contoso.com iis.contoso.com A Success Answer 1196 4 192.168.1.20
Nice! We have our RCA, the IP address and DNS record changed but the DNS cache still contained the old IP address. The restart resolved it by clearing the DNS cache but could have just as easily been cleared by using Clear-DnsClientCache.
Caveat:
The command above will clear the Windows DNS Cache but many web browsers implement their own DNS cache.
Example:
- Chromium Browsers: edge://net-internals/#dns
- Firefox: about:networking#dns
Round and round we go
It's a beautiful day and one of the data analysts you work with comes up to you and says "Help! The analytics site is broken sometimes!"
You start with some brilliant questions:
- Q: What do you mean by broken?
- A: I can't access the site
- Q: What do you mean by sometimes?
- A: Well, some of the time we can't access the website but after a while it works. And once it is working it tends to stay working.
- Q: When did this start happening?
- A: Dunno. Sometime after we updated our web servers
Now, intermittent issues aren't really something we have talked about before, so we want to adjust our data collection a bit. To make sure that we really understand an intermittent scenario, I always recommend collecting data in the working and non-working scenario.
Starting from a simple troubleshooting step, let's try to ping the machine.
PS C:\> ping iis.contoso.com Pinging iis.contoso.com [192.168.1.125] with 32 bytes of data: Reply from 192.168.1.125: bytes=32 time=3ms TTL=128 Reply from 192.168.1.125: bytes=32 time=4ms TTL=128 Reply from 192.168.1.125: bytes=32 time=4ms TTL=128 Reply from 192.168.1.125: bytes=32 time=4ms TTL=128 Ping statistics for 192.168.1.125: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 3ms, Maximum = 4ms, Average = 3ms
Pretty straightforward. How about the non-working?
Pinging iis.contoso.com [192.168.1.5] with 32 bytes of data: Reply from 192.168.1.112: Destination host unreachable. Reply from 192.168.1.112: Destination host unreachable. Reply from 192.168.1.112: Destination host unreachable. Reply from 192.168.1.112: Destination host unreachable.
This is interesting. Let's break down what we can see here.
- We are talking to 192.168.1.5 instead of 192.168.1.125
- We have destination host unreachable
This means that while we were trying to find 192.168.1.5 we couldn't find a route to this endpoint. Since we are in the same subnet as the endpoint it means that we are trying to resolve the IP address to a MAC address using the address resolution protocol (ARP).
Now there could be a problem with ARP resolving the IP to a MAC address. The thing that is most obvious here is that the IP address between working and non-working is different.
Being the clever engineer, you are. You collected a packet capture in each scenario.
Let's look at the non-working capture and see if we can figure out what has gone wrong.
3 14:34:43.034958 192.168.1.112 192.168.1.10 DNS 75 Standard query 0x7ee0 A iis.contoso.com 4 14:34:43.037763 192.168.1.10 192.168.1.112 DNS 107 Standard query response 0x7ee0 A iis.contoso.com A 192.168.1.125 A 192.168.1.5 Frame 4: 107 bytes on wire (856 bits), 107 bytes captured (856 bits) on interface \Device\NPF_{9928A5B2-320A-4B39-81AF-59C400210FED}, id 0 Ethernet II, Src: Microsoft_08:04:13 (00:15:5d:08:04:13), Dst: Microsoft_08:04:15 (00:15:5d:08:04:15) Internet Protocol Version 4, Src: 192.168.1.10, Dst: 192.168.1.112 User Datagram Protocol, Src Port: 53, Dst Port: 59014 Domain Name System (response) Transaction ID: 0x7ee0 Flags: 0x8580 Standard query response, No error Questions: 1 Answer RRs: 2 Authority RRs: 0 Additional RRs: 0 Queries iis.contoso.com: type A, class IN Answers iis.contoso.com: type A, class IN, addr 192.168.1.125 iis.contoso.com: type A, class IN, addr 192.168.1.5 [Request In: 3] [Time: 0.002805000 seconds]
In the non-working the Host A query for iis.contoso.com has returned two answers. How about in the working scenario?
8 14:32:32.180926 192.168.1.112 192.168.1.10 DNS 75 Standard query 0x2d86 A iis.contoso.com 9 14:32:32.184721 192.168.1.10 192.168.1.112 DNS 107 Standard query response 0x2d86 A iis.contoso.com A 192.168.1.5 A 192.168.1.125 Frame 9: 107 bytes on wire (856 bits), 107 bytes captured (856 bits) on interface \Device\NPF_{9928A5B2-320A-4B39-81AF-59C400210FED}, id 0 Ethernet II, Src: Microsoft_08:04:13 (00:15:5d:08:04:13), Dst: Microsoft_08:04:15 (00:15:5d:08:04:15) Internet Protocol Version 4, Src: 192.168.1.10, Dst: 192.168.1.112 User Datagram Protocol, Src Port: 53, Dst Port: 62841 Domain Name System (response) Transaction ID: 0x2d86 Flags: 0x8580 Standard query response, No error Questions: 1 Answer RRs: 2 Authority RRs: 0 Additional RRs: 0 Queries Answers iis.contoso.com: type A, class IN, addr 192.168.1.5 iis.contoso.com: type A, class IN, addr 192.168.1.125 [Request In: 8] [Time: 0.003795000 seconds]
We are talking to the same DNS server, but the answers are returned in reverse order. Why is that?
Well, this is due to a wonderful feature that is available on (most) DNS Servers called DNS Round Robin.
The gist is that when a DNS server has records for the same host, it will return them in a different order each time.
This helps us understand the comment from the analyst, "When it works it tends to continue working". When we get the correct record, it continues to work since it is cached and there is an existing connection.
When we get the valid IP address back, we cache the DNS entry and continue to use it. When we get the bogus IP address, Windows still caches the look up and will continue to attempt to use it.
With that in mind the biggest question is, what is that other DNS record?
Reaching out to the web server admin, we get the following answer:
- Q: Why is there an IP address 192.168.1.5 associated with iis.contoso.com?
- A: Oh, that is the old server IP. We meant to deprecate that.
Oops. Once the bogus DNS record was removed and it has expired out of the DNS cache of the client machines, we are all hunky dory.
Wrapping things up
DNS can be a tricky beast to troubleshoot. With how much is cached and how heavily it is leveraged there can be (no pun intended) network effects to DNS queries returning something other than expected.
However, if we keep the basics in mind, we should be able to find the cause of these issues.
And with that we have the communication basics in place! We can start digging into more involved protocols that will leverage all you have learned so far. Out of the frying pan and into the fire, our next post will be covering the basics of file sharing with SMB (CIFS for the linux folks).
See you then!