Blog Post

Azure Database for PostgreSQL Blog
5 MIN READ

Failover Between Regions with Azure PostgreSQL Flexible Server without connection string changes

bmckerr-MSFT's avatar
bmckerr-MSFT
Icon for Microsoft rankMicrosoft
Mar 17, 2023

Overview

One common pattern that we see is when customers want maximum in-region availability coupled with a cross region disaster recovery option. This typically manifests itself as, Zone Redundant HA in the primary region and a Read Replica in another region, as illustrated in the following diagram:

 

 

With ZRHA, failover between the Primary and Standby servers is automatically managed by the Azure platform and importantly, the service endpoint name does not change. However, currently with a regional failover, which is a manual process, the service endpoint name does change. A number of customers have expressed an interest in an option to perform regional failover without having to update application connection strings. By using the power and simplicity of DNS we will explain how you can ensure that connection string changes are not required when failing over between regions.

Setup

For the remainder of this article, we will use the following simplified architecture diagram as our starting point:

 

 

 

Service Name

IP

Server

Region

aue-primary-01.postgres.database.azure.com

10.0.1.4

Primary

Australia East

ause-repl-01.postgres.database.azure.com

192.168.1.4

Replica

Australia Southeast

 

We have a single Primary server, located in Australia East, which has a Replica in Australia Southeast. Both servers are setup using Private Access (VNET Injection) which uses RFC 1918 address spaces. In principle, this solution should work with Public Access servers also, but for the purposes of this article, we are focusing on the Private Access networking method.

 

For the above setup to work there is some plumbing that needs to be in place. Most of the configuration information can be found in the Networking overview - Azure Database for PostgreSQL - Flexible Server document.

Summary Prerequisites

 

  • VNET Injection. Both servers are on separate regional VNETs 
  • VNET Peering. Those VNETs need to be peered VNET Peering  
  • Private DNS Zone. Both servers need to be using the same Private DNS Zone Using-a-private-dns-zone
  • Replication. For Cross Region Replication to work. Ensure that the following section of documentation has been followed Replication Across Regions

 

Private DNS Zone

We have created a Private DNS Zone named 'flexserver.private.postgres.database.azure.com':

 

 

 

Both the Primary and Replica have been registered in this zone at server creation time and, as you can see, the servers are given randomly generated ‘Address’ records in the DNS zone:

 

‘A’ Record / Name

IP

Server

b417c0001567.flexserver.private.postgres.database.azure.com

10.0.1.4

Primary

d1433fcf5a76.flexserver.private.postgres.database.azure.com

192.168.1.4

Replica

 

It is possible to resolve these names from within either VNET. For example, here we have a Linux VM, which happens to be on the Australia East VNET, which can resolve the names of both the service name or the private DNS zone name of each of the servers. For clarity, this Linux VM is simply being used here to host the ‘psql’ binary that we are using as our "application" in this article and is not in any way required for the failover;

 

 

Not only name resolution, but courtesy of our VNET peering we can also connect to either database. First, the primary's service name and then the Private DNS Zone alias making use of the PostgreSQL command line application ‘psql’ and requesting full verification of the server certificate:

 

 

And next, the replica, again using both the service name and Private DNS Zone alias:

 

 

To recap, we have setup a Primary and a Replica in another region using Private Access networking, standard VNET peering and Private DNS Zone features. We then verified that we could connect to either database using the service name, or the name allocated by the Private DNS zone. So the question remains, how can I failover to the replica database, for example in a DR drill, and allow my application to connect to this promoted replica, without making any changes to the application configuration? The answer it turns out, is pretty simple….. 

 

In addition to typical DNS record types of ‘A’ (Address) and ‘PTR’ (Pointer) there is a useful record type that can be used as an “alias” and effectively point to another DNS entry. It is a ‘CNAME’ record type and we will show you how to configure one so that it can point to either database in our setup.

 

For our example we will create a CNAME record with value ‘prod’ that points at the ‘A’ record for our Primary server. Inside the Private DNS Zone you can ‘+ Record Set’ and add a CNAME like so:

 

 

 

Note that the default TTL will be 1 hour and you may want to reduce this to prevent DNS clients and applications caching the answer for too long. This can be significant during or after a failover. Once the CNAME record has been added the DNS zone looks like this;

 

 

Notice that the new ‘prod’ name points to the ‘A’ record for the primary server. Let us now verify that we can use the CNAME record to connect to the primary database;

 

 

When we try and connect with full certificate verification this will fail as the certificate does not match the CNAME record 'prod.flexserver.private.postgres.database.azure.com'. You can see what names are provided by the certificate by dumping the text from the certificate using the openssl tool like this:

 

 

You can see from the output above that there are 3 acceptable DNS records baked into the certificate, and our CNAME is not one of them. Therefor, using 'sslmode' of 'verify-full' fails. The workaround for this is to downgrade the certificate requirement to anything less than verify-full, for example, using 'verify-ca' results in a successful connection as show below:

 

 

The PostgreSQL Document on SSL covers the available options for 'sslmode' and what the difference are between modes. It is available here PostgreSQL: Documentation: 15: 34.19. SSL Support

 

 

When you are using Azure Private networking and Azure Private DNS Zones, as we are in this example, it is perfectly acceptable to set sslmode to 'verify-ca'.

 

It is also possible to edit the CNAME DNS record. In our case we are going to point it to the replica:

 

 

After saving the updated CNAME when we connect to ‘prod’, it will actually be the replica, which is in READ-ONLY mode and we can verify that by trying a write operation, such as creating a table:

 

 

Sure enough, the CNAME ‘prod’ is now pointing at the replica as we expected.

 

With what we have learnt so far, we can see that the flexibility of Azure Private DNS and CNAME records is ideal for this use case. The last step will be to perform the failover and complete our testing.

 

In the Azure portal we can navigate to the ‘Replication’ blade of either the Replica or the Standby and ‘Promote’ the Replica:

 

 

After clicking ‘Promote’, this window will appear:

 

Once the newly promoted Replica is available, we verify the following;

  • The CNAME record points to the Replica (now Primary)
  • The database is writeable

 

 

From an application perspective (our application is psql in this article), we haven’t had to make any changes to connect to our database regardless of which region is hosting the workload. This method can be easily integrated within DR procedures or failover testing.  Making use of the Azure CLI to semi-automate these changes is also possible and could possibly reduce the likelihood of human errors associated with changing DNS records. However, DNS changes are, in general, much simpler than making application configuration changes.

 

For reference, here is a link to a great article showing how to configure the 'sslmode' parameter for Java applications connecting with JDBC to Azure PostgreSQL Flexible Server:

Secure your Java application connections to Flexible Server via JDBC and SSL - Microsoft Community Hub

 

 

 

Updated Apr 13, 2023
Version 2.0
  • AndrewCitera's avatar
    AndrewCitera
    Copper Contributor

    bmckerr-MSFT fc388d harinarayanan94 was this resolved? It took me awhile to find it, but this issue is buried in the docs linked here. Based on tests it looks like there's no way to fully bring your primary region back to the same state prior to failover without restoring your DB from a backup because you can't edit the geo-DR backup after creation (see here) and when you create a read replica it doesn't let you modify that setting. 

     

    This is a significant factor in risk and RTO with respect to failback because you're not able to simply re-replicate back to primary with restored settings, but you need to re-replicate back, tag a backup, restore from backup, and then delete the temporarily replicated server. It's a mess.

  • fc388d's avatar
    fc388d
    Copper Contributor

    Hi, thanks for the reported note this clearly describes failover phase (the first one) for DR with Azure postgresql flexible server, any news about the second phase of DR , the failback with Azure Postgres ?

    As mentioned by harinarayanan94 in the comment above, any other option for the failback procedure (once DR event is resolved, the region is back online and user want to move back to primary region), is really the only chance (yet) to 

    "delete the old primary , re-create it and again sync it with the current primary (in DR region) and repoint . and once thats done , again to make our application DR compliant , we have to delete the server in the secondary region and initiate a fresh DR sync with the primary" ?

    Besides if the old primary was configured with Backup type "Geo Redundant Backup"  how to reconfigure it from Replica (it seems not allowed), please can you confirm ?

  • harinarayanan94's avatar
    harinarayanan94
    Copper Contributor

    Hi - thanks for the write up this surely solves one challenges out of the 2 for DR with Azure postgres . 

     

    The solution recommened should help the services point to Primary or secondary without being have to repoint or reconfigure the application , but once an replica is promoted as primary in the secondary region , once DR event is resolved or drill is completed goal would be switch everything back to the primary . The DNS change can be made but now that replication is cut and secondary (which is the current primary) has updated data , the only option left for us is to delete the old primary , re-create it and again sync it with the current primary (in DR region) and repoint . and once thats done , again to make our application DR compliant , we have to delete the server in the secondary region and initiate a fresh DR sync with the primary . Thats the problem we are trying to solve , and this is solved in Azure SQL with Failover groups and Azure dosent seem to have any such features for any flavors of postgres . We didnt have it with Single server and hoped to see it in Flexi , it didnt have it and neither does the new Cosmos Postgres (Citus ) have it . 

  • Thank you Brian, Really helpful during failover scenarios.

  • Really great article Brian..!! Provides a lot of clarity on the failover setup for Azure PGSQL..!!