Hello There
If you're hosting any real time, highly transactional and synchronous workload, low network latency and the distributed resiliency of Kubernetes can be very relevant to you. In this post I hope to objectify performance and implications of these factors for L7 ingress in Kubernetes. At a future date I hope to introduce any discrepancies in latency for e2e encryption.
On We Go
NOTE: App Gateway for Containers is currently in preview and will succeed AGIC eventually: Public preview: Azure Application Gateway for Containers | Azure updates | Microsoft Azure It will address all AGIC shortcomings documented in this blog post.
Assuming you're leveraging the Azure CNI, as long as we don't dive into the intricacies of the Cloud Native Computing Foundation, when thinking about AKS and ingress load balancing options we typically have the option of selecting between Nginx or Azure Application Gateway (AppGW). Application Gateway Ingress Controller (AGIC) within K8s will manage AppGW for you. Without going into specifics, the configuration for AGIC can be more straightforward and allows you to kill two birds with one stone if you need a Web Application Firewall (WAF) and will leverage AppGW to implement that WAF. However, it's also known that there are some tradeoffs around performance and fault tolerance when leveraging AGIC in lieu of Nginx.
What performance and resiliency tradeoffs can we expect, though?
Well, generally speaking, I ponder:
After reading above, if you'd like a refresher on k8s service proxies, see here.
The Tests
I'm going to perform some tests which will shed light on the relevance of the above points. The tests will be ran on two configurations.
Config #1: Affinity: With a high level of pod affinity. Ref: Yaml Manifests
Config #2: Anti-Affinity: With a high level of pod anti affinity. Ref: Yaml Manifests
Here is a layout of the cluster with both configs implemented:
Here's a rundown of the k8s Ingresses configured:
Structure for Tests #1 and #2:
Test 1: Comparing request latency sent from outside of the cluster, through the internet.
We see negligible differences between AGIC and Nginx when sending requests from my local machine, through the internet:
Not much discrepancy is observed. Considering that all Nginx deployments have externalTrafficPolicy=Local, this asserts all traffic from the Azure LB will only be sent to a node hosting an nginx affinity controller. Considering that these nginx controllers are already hosted on nodes where the destination pods are also deployed, I thought there would be a more noticeable improvement in latency since there would be minimal to no further proxying necessary to reach the destination. Hopefully the tests from within the cluster yield more interesting results.
Test 2a: Comparing latency of requests sent between services in the same cluster when going through Nginx vs AppGW - all traffic is sent to external LoadBalancer IP for the respective nginx controller.
Here are the results from the affinity traffic generator:
Takeaway:
Nginx affinity traffic is 47% faster compared to AppGW.
Here are the results from the anti-affinity traffic generator:
Takeaway:
Nginx anti-affinity traffic is 15% faster compared to AppGW.
Ultimate takeaway:
Nginx is 38% faster when averaging for both scenarios.
Test 2b: What if we run the same test but route nginx traffic to the ClusterIP instead of the External, LoadBalancer IP?
Here are the results for the affinity traffic generator:
Takeaway:
No difference compared to LB IP test - nginx is 47% faster for both tests.
Here are the results for the anti-affinity traffic generator:
Takeaway:
Nginx is 12% faster, a decline from the 15% difference when using the LB IP.
Test 3: How reliable is AGIC in updating AppGW configurations when Pod IPs change? Can we expect requests to be completed when pods go down for any reason?
For my last test I repeated the following process 10 times on my local machine, through the internet:
- Delete random application pod asynchronously
- Send request to respective ingress
- Wait for the request to complete
- Wait 500 ms
- Repeat.
You might think this test is somewhat unrealistic. You might be right. Go on.
These were the AGIC results for the affinity config:
Takeaways:
- The median latency was ~165 ms, but there were two outliers each at ~30,000 ms, both which returned OKs.
- 50% of the requests were negatively impacted.
- 3 502 Bad Gateway errors
- 2 requests with prolonged latencies.
- Considering that the requests were ran over the internet, I would expect the results to be worse if the test was run within the cluster. This is a concern if there's a dire dependence on low latency and fault tolerance.
- The requests that did succeed with baseline latency were lucky to land on a pod that was not impacted.
However unrealistic this test is, the real narrative is formed when we run the same test on the Nginx affinity config.
These were the Nginx results for the affinity config:
Takeaways:
- There wasn't a single hiccup in the nginx environment.
- This is testament to how quickly the overlay network is updated when a pod is destroyed and it's a substantial advantage over AGIC when it comes to fault tolerance for the lifecycle of a pod.
Concluding
I have some inferences and calls to action that I've taken away when comparing AppGW and Nginx performance.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.