Skype for Business Blog

10 MIN READ

DNS Load Balancing in Lync Server 2010

Brass Contributor

May 20, 2019

First published on TECHNET on May 25, 2011

DNS load balancing is introduced in Microsoft Lync Server 2010 communications software. The objective of DNS load balancing is to provide a native load balancing mechanism option in Lync Server 2010. DNS load balancing can be quickly and easily configured in Lync Server 2010 and is a very cost effective option for load balancing. Lync Server 2010 DNS load balancing supports “connection draining.”

Author : Keith Hanna

Publication date : May 2011

Revision history :

· November 2011: To reduce customer confusion, removed the table on Server Roles, Services, and Clients that Support DNS Load Balancing.

Product version : Microsoft Lync Server 2010

Microsoft Office Communications Server 2007 R2 and Office Communications Server 2007 require a Hardware Load Balancer (HLB) to provide resilience for the Enterprise pool. Microsoft Lync Server 2010 introduces support for DNS load balancing for SIP, as an additional option to hardware load balancing. This article explains what DNS load balancing is and where it’s used within Lync Server.

What is DNS Load Balancing?

Domain Name System (DNS) load balancing uses DNS as a way to load-balance across multiple servers. DNS load balancing is implemented at the application level in both servers and clients. They both participate in the load-balancing logic.

The process of DNS load balancing is simple and as follows:

1. The front-end servers register their fully qualified domain name (FQDN) as A records in DNS.

2. When the Enterprise pool is created, the pool FQDN (that is, the SRV record) is registered to return from DNS the list of IP addresses of all the front-end servers.

3. The client attempts to connect to one of the IP addresses that were returned. If this connection fails, the client attempts to connect to the next IP address in the list until the connection succeeds.

Note Is this DNS “round robin”? Not quite. DNS “round robin” typically refers to a method of load balancing the results of the DNS query: the first client connection request receives the first record, the second query receives the second record, and so on. In each case, only one record is returned There is no intelligence behind this method to facilitate failover—purely a connection-distribution method.

How Does DNS Load Balancing Work—in Practice?

To explain how DNS load balancing works, let’s assume we have a single pool that consists of four front-end servers with a back-end server running Microsoft SQL Server, as shown in Figure 1.

Figure 1. Pool configuration

DNS is configured as shown in Table 1. Note that the pool FQDN has multiple entries, each referring to a separate front-end server within the pool.

Table 1. DNS configuration

FQDN	IP/FQDN	Comments
_sip._tls.contoso.com	Pool.contoso.com	This is the SRV record that is used during automatic log on.
FE1.contoso.com	192.168.1.1
FE2.contoso.com	192.168.1.2
FE3.contoso.com	192.168.1.3
FE4.contoso.com	192.168.1.4
Pool.contoso.com	192.168.1.1	Without using DNS load balancing, this FQDN would resolve to the IP address of the hardware load balancer, which in turn would be responsible for directing traffic to front-end servers.
	192.168.1.2
	192.168.1.3
	192.168.1.4

The client queries DNS to resolve the FQDN of the pool (for example, pool.contoso.com) similar to Office Communications Server 2007 R2. Instead of returning the virtual IP address of the hardware load balancer, the DNS query returns the list of front-end server FQDNs as shown in Table 1. With previous versions of Office Communications Server, this query would return the IP address of the hardware load balancer.

In Lync Server, the DNS query returns the list, {192.168.1.1, 192.168.1.2, 192.168.1.3, 192.168.1.4}, to the client. The order in which these are returned to the client is irrelevant. The client chooses an IP address from the list at random, and then attempt to connect to the front-end server. If the connection fails, the client continues down the list in a random fashion until a successful connection to the pool is achieved or the list is empty.

Client Registration

With previous versions of Office Communications Server, the client would be able to successfully connect to a front-end server. It would then register the client’s SIP URI into the single-shared registration database that is stored on the back-end server running Microsoft SQL Server. However, in Lync Server, each front-end server in a pool has a completely independent registration database. Each user is assigned a predefined registration database (that is, registrar) that they should connect to. This registrar assignment is calculated by a hash value of the user’s SIP URI. Therefore, it is highly likely that a randomly selected front-end server is the incorrect server for the client to connect to. In this example, there is a 75 percent chance the client would randomly contact the wrong server. In other words, there is a 25 percent chance that the correct server was reached on the first attempt. No SIP redirect would be required.

Important When multiple clients from the same user are considered, all clients must register to the same front-end server (registrar). This ensures that any client that is associated with a single user can be located in a single location. This simplifies the call-routing logic.

A static mapping of users to registrars cannot be used because of the requirement to cater for individual server failure. To solve this problem, Lync Server uses a hash algorithm to determine which front-end server the client will primarily connect to, and also the order of failover–for every front-end server in the pool. This will ensure that all clients from the same user will consistently connect to the same front-end server, and in turn connect to the same registration database. The hash algorithm is based on the maximum number of servers in the pool (10). This will help ensure that users are evenly spread across all available front-end servers in a pool. The maximum number of servers is used to ensure that hash values never need to be recomputed with the addition or removal of servers to or from the pool.

Let’s look at the following example in more detail.

In Figure 1, we have four front-end servers, {FE1, FE2, FE3, FE4}. The user retrieves this list of servers from DNS, and then randomly connects to one in the list. In our example, let’s assume that the client connects to FE3. Upon connecting, the client presents its SIP URI from which a hash is generated by the front-end server. From this hash, the server determines the location of the registrar assigned to it.

Note When a user is first enabled on (or moved to) a pool, a hash is generated to determine which front-end server is the primary registration database for the user, along with the order in which the remaining front-end servers will be attempted (as the backup registrar services).

For our example, our user hash results in {FE4, FE2, FE1, FE3}. This is then interpreted as the order in which the clients will attempt to register.

The client attempts to register with FE3, but because it’s not the primary registrar assigned to the user, FE3 redirects the client to FE4 as the correct registrar to connect to. The client successfully registers with FE4.

Now, in the case of the primary registrar front-end server being unavailable, there are two scenarios to consider and are as follows:

The initial server chosen from the DNS response is unavailable. Using DNS load balancing, the client will then simply attempt another server from the list until a successful connection is received. From here, a registration is attempted by using the previously mentioned logic.

The primary registrar is unavailable. In this scenario, the server that provides the redirect will be fully aware of the states of other servers in the pool and will then redirect the client to connect to the first available backup registrar front-end server that is determined by the hash list of the user’s SIP URI.

In the previous example, we considered only four servers; however, as previously mentioned, these hash values are calculated based on the maximum number of servers in a pool (10). When front-end servers are added to a pool in topology builder, they are assigned an ID in the range of 1 through 10 to allow the hash mapping to take place. These IDs are randomly assigned.

This introduces another complexity. If we have less than 10 servers deployed, how is a front-end server determined to be available? Figure 2 shows a mapping of IDs to front-end servers for our example.

Figure 2. Server ID mapping

All registrars marked with “X” are nonexistent (either due to temporary failure or because they have not yet been commissioned).

Figure 3 shows two sample logical registrar sequences for two different users. This is used to determine the order in which the respective user’s clients will attempt to register to their pool.

Figure 3. User Logical Registrar Sequence

Using both the logical registrar sequence and the physical registrar sequence, the pool can determine which registrar the user’s client should connect to. This is done by iterating through the logical sequence and mapping to the physical registrar sequence until the first physical registrar (that is, front-end server) that is available is found. In the case of User 1, the first logical registrar is {7}. It maps to FE3 as shown in Figure 4.

Figure 4. Example of user mapping

In the case of User 2, the first three logical registrars, {3, 6, 2}, are unavailable. The next logical registrar, {7}, which maps to FE3, is available. User 2’s client would connect to FE3. In this case, User 2 is said to be connected in backup mode.

Server Failure and Recovery

Now our client has the DNS and registration information. What happens in the scenario where a server has failed?

When a server fails, the physical registrar sequence is updated to show the server as unavailable and shared amongst all surviving servers by using a server-server heartbeat. This ensures that all the servers are continually aware of the state of the pool. Any connecting clients are managed as shown in Figure 3 and Figure 4. Any users who would primarily connect to the failed server are redirected to the next server in their logical registrar sequence and are then connected in backup mode.

Now, at some point in the future, the server will be recovered, returning the physical registrar sequence back to its original state. When the physical registrar sequence is updated, each server that has been available during the outage checks to see if any of the users in backup mode should be registered as primary on the (now-recovered) server. If this number is greater than 0, the users will be de-registered and redirected to their primary registrar.

Note De-registration is carried out in batches of users (not batches of clients) to ensure that the network is not overloaded. All clients that belong to a user must be re-registered in the same batch. Because of this batching nature, it will take time for the front-end servers to stabilize.

Server Commission and Decommission

The difference between the previous section and this section is reflected in a change to the topology. When topology changes occur, the logical registrar sequence is recalculated for all users, resulting in some users being re-homed to a different front-end server in the same pool.

Using the same example in Figure 3 and the users in Figure 4, if a new front-end server is commissioned and is given ID=3 (or 6 or 2), the mapping for User 2 is changed to introduce a new primary registrar for that user as shown in Figure 5. When the server is fully operational, the heartbeat process updates the physical registrar sequence, from which a check for users in backup mode is triggered. This results in the batched re-registration process (if necessary).

Figure 5. User mapping to a new server

If the newly commissioned server was given ID=10, changes would result for neither User 1 nor User 2. Both are currently registered to a server that appears before {10} in their logical registrar sequence. The result would be that both users would have an additional server that could be used in backup mode.

Decommission is very similar to server failure, with the exception of the re-home to a new primary registrar being part of the decommission process. The topology change results in the recalculation of the logical registrar sequence. This step doesn’t happen in a server failure.

Note In these examples, the server names are shown to match the diagram in Figure 1, making the correlation easier to follow. The actual physical and logical sequences are managed by using an internal server ID—not by using the server name.

Why Do We Still Need Hardware Load Balancers?

Both HTTP and HTTPS are session-state–oriented protocols. This means that if I start a conversation with Server A, Client A needs to continue to talk with Server A to complete the entire request. With DNS load balancing, there is no sticky-session state that can be set up. As a result, there is no way to ensure that a session is going to be continued on Server A.

HLB specifically addresses this session problem by caching the client-server state information; when the next request comes in from Client A, the HLB refers it back to Server A–regardless of whether Server A is busy. It waits and sends the request when possible. Hence, for Web-based traffic, DNS load balancing is not a solution.

Summary

In addition to supporting hardware load balancers, Lync Server introduces the option to load balance clients’ SIP connections to a pool of front-end servers using DNS. DNS load balancing is only part of the client connectivity matrix. The internal hashing and distribution of the client registration information is the other part. The two mechanisms work together to determine how a client connects to a pool.

DNS load balancing is not supported for load balancing Web traffic. As a result, hardware load balancing is still required for load balancing Web traffic (such as address book services) in Lync Server 2010.

Hardware load balancer is still required to load balance SIP traffic when operating in a mixed Lync Server and legacy environment (Office Communications Server 2007 R2 or Office Communications Server 2007).