Author : Reza Ramezanpour Senior Developer Advocate @ Tigera
For most of Kubernetes’ life, service networking has relied on iptables. It was practical, widely available on all Linux distributions, and good enough for clusters with modest scale with predictable workloads. However, as cloud providers are increasing the pod limits for clusters and deployments are taking advantage of their multi-regional, highly available infrastructure the need for running more workloads shines a light on an old design problem.
Today’s production clusters by taking advantage of cloud provider infrastructure run thousands of services that may experience constant endpoint churn, and must satisfy strict requirements around performance, security, and compliance. In this case the old iptables model which we will discuss here is inefficient and it was never designed with these environments in mind.
This is why the Kubernetes community started to move away from iptables toward nftables, and the upstream Kubernetes graduated the kube-proxy support for nftables mode to stable in the v1.33 release which was followed by Tigera’s free and open source Project Calico v3.29.
Last year, Microsoft’s decision to support kube-proxy in nftables mode reflects a broader reality, the traditional iptables model is becoming a structural bottleneck for modern Kubernetes platforms, including managed environments such as AKS.
In this blog we are going to use Microsoft’s latest kube-proxy preview features to create a Bring Your Own CNI cluster configured for nftables, and use Project Calico to establish networking and security on it. We will also look at why a shift from iptabels to nftables is recommended and should happen sooner than later.
The Hidden Tax of iptables
You might be using iptables in a large cluster today and thinking, everything works, why change it? But that’s how the problem usually starts.
The iptables limitations don’t show up as a failure. They show up gradually: higher CPU usage, slower updates, harder debugging. Then one day, they’re unavoidable. Think of it like this, You have two security guards checking people entering a stadium.
Both have the same guest list.
- The first guard (iptables) holds the list on paper. For every person, he starts at the top of the list, scans name by name, finds a match, then goes back to the top for the next person. Each time a new person joins the line, he needs to re-write the whole/or part of the list again and start from scratch.
- The second guard (nftables) has the list in a searchable database. He types the name, gets an instant answer, and moves on.
Both let the right people in. However, one of these security guards slows down as the crowd grows, and at some point it will just give up scanning for new people. That’s the hidden cost of sticking to iptables, lookup time grows with the number of services that you have in your cluster, and updates get more expensive as the system scales.
Creating Your First AKS Managed Nftables Ready Cluster
Microsoft Azure Kubernetes Service (AKS) now includes preview support for running kube-proxy in nftables mode. This preview is opt-in and not enabled by default, because it’s still under development.
To use it in AKS you must explicitly opt into the preview :
az extension add --name aks-preview
az extension update --name aks-preview
Next, you have to register the kubeproxy custom configuration preview feature.
az feature register --namespace "Microsoft.ContainerService" --name "KubeProxyConfigurationPreview"
Once the feature is registered (this may take a few minutes to complete), you can customize the kube-proxy deployment. To enable nftables mode, define a minimal kube-proxy configuration like the following and save it as kube-proxy.json, which will be referenced when creating the AKS cluster in the next step:
{
"enabled": true,
"mode": "NFTABLES"
}
With that config saved in a file (e.g., kube-proxy.json), issue the following command and create your cluster:
az group create --name nftables-demo --location canadacentral
az aks create \
--resource-group nftables-demo \
--name calico-nftables \
--kube-proxy-config kube-proxy.json \
--network-plugin none \
--pod-cidr "10.10.0.0/16" \
--generate-ssh-keys \
--location canadacentral \
--node-count 2 \
--vm-size Standard_A8m_v2
After the cluster is created, verify that kube-proxy is running in nftables mode by issuing the following command:
kubectl logs -n kube-system ds/kube-proxy | egrep "Proxier"
Output:
I0129 21:09:06.613047 1 server_linux.go:259] "Using nftables Proxier"
After you’ve confirmed nftables is active, install Calico.
kubectl create -f https://docs.tigera.io/calico/latest/manifests/tigera-operator.yaml
Next, install the cluster installation resource which instructs Tigera operator about the Calico features that should be enabled in your environment.
kubectl create -f https://gist.githubusercontent.com/frozenprocess/6932bae9e33468b53696f1a901f2aa76/raw/d74ba8ef30a5657e7474a1fe22a650f87c08ecaf/installation.yaml
If you already installed Calico and are using any of its dataplanes such as iptables, IPVS or Calico eBPF mode simply use the following command to only change the dataplane to nftables.
kubectl patch installation default --type=merge -p='{"spec":{"calicoNetwork":{"linuxDataplane":"Nftables"}}}'
A Tale of Two Backends - How Services Works In Nftables
Now that we have the analogy and an environment, let's look at the actual differences between these two backends. For that, first deploy an application on your cluster. This demo app creates a deployment with 1 replica and a service with the type of LoadBalancer.
NAME READY STATUS RESTARTS AGE
pod/anp-demo-app-5669879946-rb28v 1/1 Running 0 91s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/container-service LoadBalancer 10.0.92.204 40.82.188.95 80:31907/TCP 90s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/anp-demo-app 1/1 1 1 91s
NAME DESIRED CURRENT READY AGE
replicaset.apps/anp-demo-app-5669879946 1 1 1 92s
1. Service Maps (The list in our analogy)
In nftables mode, kube-proxy doesn't just create a long list of linear rules. Instead, it leverages maps. Think of a map as a high-speed lookup table. Our first step is to see how Kubernetes stores the mapping between a Service's Public/Cluster IP and its internal handling logic.
kubectl exec -itn calico-system ds/calico-node -c calico-node -- nft list map ip kube-proxy service-ips
Output:
type ipv4_addr . inet_proto . inet_service : verdict
elements = {
40.82.188.95 . tcp . 80 : goto external-SG2JDYAZ-web-demo/container-service/tcp/
}
What’s happening here? The output shows that any traffic hitting the External IP 40.82.188.95 on port 80 is immediately told to goto a specific chain. Unlike iptables, which has to evaluate rules one by one, nftables jumps straight to the relevant logic.
2. The Dispatcher Chain
Once a packet matches a service in the map, it is handed off to a specific chain. This chain acts as a dispatcher, preparing the packet for the "External" world.
A map is a simple way to figure out what to do if a flow is matched, here the idea is if source IP, protocol, and destination port matches with a flow the verdict should arrive by using the external-SG2JDYAZ-web-demo/container-service/tcp/ chain entry.
Note that an external facing service doesn’t have access to the internal pods, so now nftables need to NAT the communication and send it to the pod.
kubectl exec -itn calico-system ds/calico-node -c calico-node -- nft list chain ip kube-proxy "external-SG2JDYAZ-web-demo/container-service/tcp/"
Note: Here, nftables marks the packet for Masquerade (SNAT) to ensure the return traffic flows back through the gateway, and then forwards it to the primary service chain.
Output:
table ip kube-proxy {
chain external-SG2JDYAZ-web-demo/container-service/tcp/ {
jump mark-for-masquerade
goto service-SG2JDYAZ-web-demo/container-service/tcp/
}
}
3. Load Balancing via Verdict Maps
This is where the magic happens. Kubernetes needs to decide which Pod should receive the traffic. In nftables, this is handled using a vmap (verdict map) combined with a random number generator.
kubectl exec -itn calico-system ds/calico-node -c calico-node -- nft list chain ip kube-proxy "service-SG2JDYAZ-web-demo/container-service/tcp/"
Notice the numgen random mod 1. Since we currently have only one replica, the logic is simple: 100% of traffic goes to the single available endpoint.
table ip kube-proxy {
chain service-SG2JDYAZ-web-demo/container-service/tcp/ {
ip daddr 10.0.92.204 tcp dport 80 ip saddr != 10.10.0.0/16 jump mark-for-masquerade
numgen random mod 1 vmap { 0 : goto endpoint-KEOBOJ73-web-demo/container-service/tcp/__10.10.219.68/80 }
}
}
4. The Final Destination: DNAT to Pod IP
The last stop in the nftables journey is the endpoint chain. This is where the packet’s destination address is actually changed from the Service IP to the Pod’s internal IP.
kubectl exec -itn calico-system ds/calico-node -c calico-node -- nft list chain ip kube-proxy "endpoint-KEOBOJ73-web-demo/container-service/tcp/__10.10.219.68/80"
The rule dnat to 10.10.219.68:80 rewritten the packet header. The packet is now ready to be routed directly to the container.
Output:
table ip kube-proxy {
chain endpoint-KEOBOJ73-web-demo/container-service/tcp/__10.10.219.68/80 {
ip saddr 10.10.219.68 jump mark-for-masquerade
meta l4proto tcp dnat to 10.10.219.68:80
}
}
Scaling Deployments
What happens when we scale? Let’s increase our replicas to 10 to see how nftables updates its load-balancing table dynamically.
kubectl patch deployment -n web-demo --type=merge anp-demo-app -p='{"spec":{"replicas":10}}'
What changed? The numgen (number generator) will now be mod 10, and the vmap will contain 10 different target chains (one for each Pod). This is significantly more efficient than the "probability" flags used in legacy iptables. You can verify this by re-issuing the command from step 3.
Note: The actual journey doesn’t start at the map; it begins at the nat-prerouting chain, moves to the services chain, and is finally evaluated against the service-ips map.
If you like to learn how iptables services works take a look at this free course.
It’s end of an era
As more and more Linux distributions, companies, and projects try to shift from iptables to nftables the cost of maintaining an iptables environment will increase for its users. The shift from iptables to nftables isn't just a minor version bump; it is a fundamental architectural upgrade for the modern cloud-native solution. As we examined, the "hidden tax" of linear rule evaluation in iptables becomes a bottleneck while your Kubernetes clusters scale into thousands of services and endpoints. This is why the industry is looking for solutions that are more tuned for the cloud native environment.
By leveraging the new nftables mode in kube-proxy, that is now in a preview mode in Microsoft AKS and added to Kubernetes from v1.33 and Project Calico v3.29+, platform engineers can finally move away from iptables.
Footnote - Observability and troubleshooting
In this blog, we explored the inner workings of nftables using the command line, a powerful way to understand the underlying mechanics of packet filtering. However, manual terminal work isn't the only way to troubleshoot networking in Kubernetes. In a production environment, you often need a higher-level view to observe traffic flows and policy impacts across hundreds of pods.
Several open-source projects provide "browser-based" observability, allowing you to debug network flows visually rather than parsing text logs.
Network Observability Tools
Calico Whisker: A dedicated observability tool for Calico (v3.30+) that visualizes real-time network flows. It is particularly useful for debugging Staged Network Policies, as it shows you exactly which flows would be denied before you actually enforce the rules. Learn more about Calico Whisker here.
kubectl port-forward -n calico-system svc/whisker 8081:8081
Microsoft Retina: A CNI-agnostic, eBPF-based platform that provides deep insights into packet drops, DNS health, and TCP metrics. It works across different cloud providers and operating systems (including Windows). Learn more about Microsoft Retina here.