The customer asked some SI (system integration) partner to migrate their system to cloud in the form of "lift-and-shift", but session affinity did not work properly. Session affinity worked fine when system ran in their data center.
Environment and deployment topology
Their deployment topology was listed below. This issue occurred after migration completed. The customer did not deploy their system on multi-region.
Azure Load Balancer (ALB) : Traffic routing is based on protocol and client IP.
We'd like to configure cookie based session affinity.
We'd like to achieve it as inexpensive as possible.
In case packaged application is hosted on Java EE application server, session affinity is typically configured using clustering of Java EE application server or session sharing with in-memory data grid or cache. However, they could not configure application server cluster since clustering was restricted against the edition they used. And the SI partner deployed L7 LB NVA behind ALB to achieve session affinity, as the SI partner knew ALB did not have session affinity feature.
Let's imaging causes of this issue
There are many people who can imagine the root cause when looking at the deployment topology above. The following points should be checked.
Would source IP of inbound traffic to ALB (public) change? Specifically, would global IP be changed when transforming local IP to global IP using SNAT on customer site?
ALB does not have any feature for session affinity. Therefore, if source IP of inbound traffic is changed, the destination VM which hosts packaged application should change.
Would reverse proxy develop side effect?
Would L7 LB NVA which deploys behind ALB work as expected? Would session information be shared between both NVAs?
This issue occurred due to the following causes.
Source IP of inbound traffic was sometimes changed due to SNAT.
When source IP was changed, ALB (public) recognized that this traffic came from another client and routed the traffic to another L7 LB NVA.
L7 LB NVAs were deployed behind ALB for session affinity, but they did not work expectedly since session information was not shared with the NVAs. When inbound traffic was routed to one L7 LB NVA, the L7 LB NVA did not have any way to identify session continuity. So, the NVA recognized that this traffic came from other client.
The following URL describes traffic distribution rule.
The following table is listed what happened in each component specifically.
What would happen?
The fact of the matter is that traffic comes from the same client, but the traffic is sometimes NATed into other global IP. In this case, ALB (public) recognizes that this traffic comes from different client, and routes the traffic to any L7 LB NVAs. Therefore, chosen L7 LB NVA might be different from the one processed previous traffic from the same client.
L7 LB NVA
If L7 LB NVAs are configured in the form of "Active-Active" but session information is not shared between L7 LB NVAs, no L7 LB NVA can identify whether or not the traffic comes from the same client. Therefore, L7 LB NVA can route traffic to any reverse proxy NVAs and chosen reverse proxy NVA might be different from the one processed previous traffic.
If a reverse proxy NVA where current traffic passed is different from the one processed previous traffic, ALB (Internal) recognizes that this traffic comes from different client since source IP is different, and routes the traffic to any internal L7 LB NVAs. Therefore, chosen L7 LB NVA might be different from the one processed previous traffic from the same client.
Internal L7 LB NVA
This is the same as mentioned above. Since session information is not shared between internal L7 LB NVAs, no internal L7 LB NVA can identify whether or not the traffic comes from the same client. Therefore, internal L7 LB NVA can route traffic to any VMs hosted packaged application and chosen VM might be different from the one processed previous traffic.
Traffic routing was not consistent due to reasons mentioned above, so traffic was sometimes routed to the VM which handled previous traffic, and at other times another traffic was routed to the different VM.
I commented points to be fixed and SI partner reconfigured component topology. After that, traffic was routed to an expected package application node.
ALB, L7 LB NVAs, and Reverse Proxy NVAs were replaced with Azure Application Gateway (App GW).
Cookie based affinity was enabled following the document.
Here is the reconfigured component topology. This topology helped the customer reduce NVA related cost and operational cost.
In this case, Azure Front Door as a public L7 LB is not a good solution since the customer's system was not deployed to multiple regions. In other words, global service does not meet their requirement.
In this case, App GW features met their requirements for reverse proxy. If App GW does not meet customer requirements for reverse proxy (for example, reverse proxy for authentication gateway is required), the following topology would be better.
The following points are important when migrating existing systems to cloud.
Good understanding of services you are using.
Simple deployment topology. In other words, decrease the number of components you use.