AKS or Azure Kubernetes Service is a fully managed Kubernetes container orchestration service that enables you to deploy, scale, and manage containerized applications easily. However, even with the most robust systems issues can arise that require troubleshooting.
This blog post marks the beginning of a three-part series, that originated from an intensive one-day bootcamp focused on advanced AKS networking triage and troubleshooting scenarios. It offers a practical approach to diagnosing and resolving common AKS networking issues, aiming to equip readers with quick troubleshooting skills for their AKS environment.
Each post walks through a set of scenarios that simulate typical issues. Detailed setup instructions will be provided to build a functional environment. Faults will then be introduced that causes the setup to malfunction. Hints will be provided on how to triage and troubleshoot these issues using common tools such as kubectl, nslookup, and tcpdump. Each scenario concludes with fixes for the issues faced and explanation of the steps taken to resolve the problem.
Before setting up AKS, ensure that you have an Azure account and subscription, with permissions that allows you to create resource groups and deploy AKS clusters. PowerShell needs to be available as PS scripts will be used. Follow instructions provided in this Github link to set up AKS and run scenarios. It is also recommended that you read up on troubleshooting inbound and outbound networking scenarios that may arise in your AKS environment.
For inbound scenarios, troubleshooting connectivity issues pertains to applications hosted on the AKS cluster. Link describes issues related to firewall rules, network security groups, or load balancers, and provides guidance on verifying network connectivity, checking application logs, and examining network traffic to identify potential bottlenecks.
For outbound access, troubleshooting scenarios are related to traffic leaving the AKS cluster, such as connectivity issues to external resources like databases, APIs, or other services hosted outside of the AKS cluster.
Figure below shows the AKS environment, which uses a custom VNet with its own NSG attached to the custom subnet. The AKS setup uses the custom subnet and will have its own NSG created and attached to the Network Interface of the Nodepool. Any changes to the AKS networking are automatically added to its NSG. However, to apply AKS NSG changes to the custom Subnet NSG, they must be explicitly added.
Objective: The goal of this exercise is to troubleshoot and resolve connectivity between pods and services within the same Kubernetes cluster.
Layout: AKS cluster layout with 2 Pods created by their respective deployments and exposed using Cluster IP Service.
kubectl create ns student
kubectl config set-context --current --namespace=student
# Verify current namespace
kubectl config view --minify --output 'jsonpath={..namespace}'
kubectl create deployment nginx-1 --image=nginx
kubectl expose deployment nginx-1 --name nginx-1-svc --port=80 --target-port=80 --type=ClusterIP
kubectl create deployment nginx-2 --image=nginx
kubectl expose deployment nginx-2 --name nginx-2-svc --port=80 --target-port=80 --type=ClusterIP
Confirm deployment and service functional. Pods should be running and services listening on Port 80.
kubectl get all
# Services returned: nginx-1-svc for pod/nginx-1, nginx-2-svc for pod/nginx-2
kubectl get svc
# Get the values of <nginx-1-pod> and <nginx-2-pod>
kubectl get pods
# below should present HTML page from nginx-2
kubectl exec -it <nginx-1-pod> -- curl nginx-2-svc:80
# below should present HTML page from nginx-1
kubectl exec -it <nginx-2-pod> -- curl nginx-1-svc:80
# check endpoints for the services
kubectl get ep
kubectl get deployment.apps/nginx-2 -o yaml > nginx-2-dep.yaml
kubectl get service/nginx-2-svc -o yaml > nginx-2-svc.yaml
kubectl delete -f nginx-2-dep.yaml
kubectl apply -f broken.yaml
kubectl get all
Below is the inbound flow. Confirm every step from top down.
kubectl get nodes
kubectl exec -it <nginx-1-pod> -- curl nginx-2-svc:80
# msg Failed to connect to nginx-2-svc port 80: Connection refused
kubectl exec -it <nginx-1-pod> -- curl nginx-1-svc:80
# displays HTML page
kubectl exec -it <nginx-2-pod> -- curl localhost:80
# displays HTML page
kubectl get ep
kubectl describe service <service-name>
Ensure that it matches the label selector used by its corresponding Deployment using describe command:
kubectl describe deployment <deployment_name>
Use ‘k get svc’ and ‘k get deployment’ to get service and deployment names.
Do you notice any discrepancies?
kubectl get pods --selector=<selector_used_by_service>
If no results are returned then there must be a label selector mismatch.
From below figure, selector used by deployment returns pods but not the selector used by corresponding service.
k logs pod/<nginx-2> # no incoming traffic
k logs pod/<nginx-1> # HTTP traffic as seen below
k logs svc/<nginx-2>
k logs svc/<nginx-1>
# Get label
kubectl describe service nginx-2-svc
# When attempting to obtain pods using the service label, results in "no resources found" or "no pods available".
kubectl describe pods -l app=nginx-2
kubectl delete -f nginx-2-dep.yaml
In broken.yaml, update labels 'app: nginx-02', to 'app: nginx-2', as shown below
kubectl apply -f broken.yaml # or apply dep-nginx-2.yaml
k describe pod <nginx-2>
k get ep # nginx-2 svc should have pods unlike before
# Should return HTML page from nginx-2-svc
kubectl exec -it <nginx-1 pod> -- curl nginx-2-svc:80
# Confirm above from logs
k logs pod/<nginx-2>
Currently Services in your namespace ‘student’ will resolve using <service name>.<namespace>.svc.cluster.local.
Below command should return web page.
k exec -it <nginx-1 pod> -- curl nginx-2-svc.student.svc.cluster.local
kubectl apply -f broken2.yaml
kubectl delete pods -l=k8s-app=kube-dns -n kube-system
# Monitor to ensure pods are running
kubectl get pods -l=k8s-app=kube-dns -n kube-system
k exec -it <nginx-1 pod> -- curl nginx-2-svc.student.svc.cluster.local
k exec -it <nginx-1 pod> -- curl nginx-2-svc
k get cm -A -n kube-system | grep dns
k describe cm coredns -n kube-system
k describe cm coredns-autoscaler -n kube-system
k describe cm coredns-custom -n kube-system
kubectl delete cm coredns-custom -n kube-system
kubectl delete pods -l=k8s-app=kube-dns -n kube-system
# Monitor to ensure pods are running
kubectl get pods -l=k8s-app=kube-dns -n kube-system
kubectl exec -it <nginx-1 pod> -- curl nginx-2-svc.student.svc.cluster.local
# Challenge lab: Resolve using FQDN aks.com #
# Run below command to get successful DNS resolution
k exec -it <nginx-1 pod> -- curl nginx-2-svc.aks.com
# Solution #
k apply -f working2.yaml
kubectl delete pods -l=k8s-app=kube-dns -n kube-system
# Monitor to ensure pods are running
kubectl get pods -l=k8s-app=kube-dns -n kube-system
# Confirm working using below cmd
k exec -it <nginx-1 pod> -- curl nginx-2-svc.aks.com
# Bring back to default
k delete cm coredns-custom -n kube-system
kubectl delete pods -l=k8s-app=kube-dns -n kube-system
# Monitor to ensure pods are running
kubectl get pods -l=k8s-app=kube-dns -n kube-system
In broken.yaml deployment labels didn’t match up with the service i.e., it should have been nginx-2
In broken2.yaml breaking changes were made that resolved ‘student.svc.cluster.local’ to ‘bad.cluster.local’, which broke DNS resolution.
$kubectl_apply=@"
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns-custom
namespace: kube-system
data:
internal-custom.override: | # any name with .server extension
rewrite stop {
name regex (.*)\.svc\.cluster\.local {1}.bad.cluster.local.
answer name (.*)\.bad\.cluster\.local {1}.svc.cluster.local.
}
"@
$kubectl_apply | kubectl apply -f –
k delete deployment/nginx-1 deployment/nginx-2 service/nginx-1-svc service/nginx-2-svc
or just delete namespace > k delete ns student
Objective: The goal of this exercise is to troubleshoot and resolve Pod DNS lookups and DNS resolution failures.
Layout: Cluster layout as shown below has NSG applied to AKS subnet, with Network Policies in effect.
kubectl create ns student
kubectl config set-context --current --namespace=student
# Verify current namespace
kubectl config view --minify --output 'jsonpath={..namespace}'
kubectl run dns-pod --image=nginx --port=80 --restart=Never
kubectl exec -it dns-pod -- bash
# Run these commands at the bash prompt
apt-get update -y
apt-get install dnsutils -y
exit
kubectl exec -it dns-pod -- nslookup kubernetes.default.svc.cluster.local
kubectl apply -f broken1.yaml
kubectl exec -it dns-pod -- nslookup kubernetes.default.svc.cluster.local
kubectl exec -it dns-pod -- nslookup kubernetes.default.svc.cluster.local
# If response ‘connection timed out; no servers could be reached’ then proceed below with troubleshooting
kubectl get svc kube-dns -n kube-system
$coredns_pod=$(kubectl get pods -n kube-system -l k8s-app=kube-dns -o=jsonpath='{.items[0].metadata.name}')
kubectl logs -n kube-system $coredns_pod
kubectl describe cm coredns-custom -n kube-system
kubectl get networkpolicy -A
NAMESPACE NAME POD-SELECTOR
kube-system block-dns-ingress k8s-app=kube-dns
kubectl describe networkpolicy block-dns-ingress -n kube-system
# should show on Ingress path not allowing DNS traffic to UDP 53
kubectl delete networkpolicy block-dns-ingress -n kube-system
kubectl run -it --rm --restart=Never test-dns --image=busybox --command -- nslookup kubernetes.default.svc.cluster.local
# If the DNS resolution is working correctly, you should see the correct IP address associated with the domain name
# Below CLI steps can also be performed as a lookup on Azure portal under NSG
kubectl expose pod dns-pod --name=dns-svc --port=80 --target-port=80 --type LoadBalancer
kubectl get svc
kubectl exec -it dns-pod -- curl <EXTERNAL-IP>
curl <EXTERNAL-IP>
$custom_aks_nsg = "custom_aks_nsg" # <- verify
$nsg_list=az network nsg list --query "[?contains(name,'$custom_aks_nsg')].{Name:name, ResourceGroup:resourceGroup}" --output json
# Extract Custom AKS Subnet NSG name, NSG Resource Group
$nsg_name=$(echo $nsg_list | jq -r '.[].Name')
$resource_group=$(echo $nsg_list | jq -r '.[].ResourceGroup')
echo $nsg_list, $nsg_name, $resource_group
$EXTERNAL_IP="<insert>"
az network nsg rule create --name AllowHTTPInbound `
--resource-group $resource_group --nsg-name $nsg_name `
--destination-port-range 80 --destination-address-prefix $EXTERNAL_IP `
--source-address-prefixes Internet --protocol tcp `
--priority 100 --access allow
curl <EXTERNAL-IP>
Broken1.yaml is a Network Policy that blocks UDP ingress requests on port 53 to all Pods
k delete pod/dns-pod
or
k delete ns student
az network nsg rule delete --name AllowHTTPInbound `
--resource-group $resource_group --nsg-name $nsg_name
This post demonstrates common connectivity and DNS issues that can arise when working with AKS. The first scenario focuses on resolving connectivity problems between pods and services within the Kubernetes cluster. We encountered issues where the assigned labels of a deployment did not match the corresponding pod labels, resulting in non-functional endpoints. Additionally, we identified and rectified issues with CoreDNS configuration and custom domain names. The second scenario addresses troubleshooting DNS and external access failures. We explored how improperly configured network policies can negatively impact DNS traffic flow. In the next article, second of the three-part series, we will delve into troubleshooting scenarios related to endpoint connectivity across virtual networks and tackle port configuration issues involving services and their corresponding pods.
The sample scripts are not supported by any Microsoft standard support program or service. The sample scripts are provided AS IS without a warranty of any kind. Microsoft further disclaims all implied warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose. The entire risk arising out of the use or performance of the sample scripts and documentation remains with you. In no event shall Microsoft, its authors, or anyone else involved in the creation, production, or delivery of the scripts be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the sample scripts or documentation, even if Microsoft has been advised of the possibility of such damages.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.