Azure Database for PostgreSQL Flexible server addresses several fundamental requirements including security, availability, reliability, scalability, performance, business continuity & disaster recovery suitable to run your mission-critical workloads.
This blog focuses on the high availability (HA) aspect of the Flexible Server PostgreSQL, including two new capabilities to the high availability feature.
Ability to deploy standby in the same zone as the primary for same-zone HA.
Ability to choose the standby AZ for your zone-redundant HA.
What is Flexible server PostgreSQL high availability architecture?
Flexible Server PostgreSQL deploys a standby server with identical compute and storage as the primary in another physical node within a region. The standby server is deployed in the same availability zone (AZ) or a different AZ depending on your HA deployment choice. Using heath monitoring and automatic failover in place, Flexible server HA configuration helps with high uptime during both planned and unplanned outages.
The Flexible server HA architecture uses PostgreSQL streaming replication technology that streams logs to the standby server in synchronous mode. The application writes and commits are first written to the primary server's write-ahead-logs (WAL) which is streamed to the standby server. Once the WAL data is persisted on the standby site, the application writes are acknowledged. This provides zero data loss capability in the event of a failover. See this documentation for more details. Currently, the standby server is not supported to run your read workload.
Flexible server PostgreSQL uses Premium managed disks (Locally redundant storage within the AZ with 3 copies of data) for storing data and logs for each server. With HA configuration, you now have six copies of data between the primary and the standby servers. That helps with providing high data resiliency and isolation. Periodic data backups (snapshots) are performed from the primary server. WAL files are continuously archived to the backup storage. Both snapshot data and WAL files backups are stored on zone-redundant storage (ZRS) in regions where AZs are supported. Otherwise, they are stored using local-redundant storage (LRS).
For detailed architecture, stead-state operations, planned and unplanned downtime experience, and HA workflow mechanisms, see the HA documentation.
What HA deployment models are available with Flexible Server PostgreSQL?
Flexible Server PostgreSQL supports two HA deployment models.
1. Zone-redundant HA
You can configure your server in zone-redundant HA mode in which your primary and standby servers are deployed across AZs within a region. You now have the capability to choose the AZ for your standby server. This provides more control for you to co-locate your clients and applications along with databases in both the primary and the standby AZs. Zone-redundant HA offers 99.99% of uptime SLA. See here for details.
Figure 1: Diagram of zone redundant HA architecture
2. Same-zone HA
The other HA deployment model that we recently introduced is the Same-zone HA. By choosing this option, your standby server is automatically provisioned in the same AZ as the primary. This deployment model helps with reduced writes/commits logs roundtrip latency - as the traffic is within the AZ and not across AZs (which could be up to 2ms) while still providing compute and storage isolation. This deployment model is also useful to provide redundancy in regions that don’t support AZ yet or regions that have restrictions to deploy zone-redundant HA. Same-zone HA deployment offers 99.95% of uptime SLA. See here for details.
Figure 2: Diagram of same-zone HA architecture
How do I deploy, manage, and test HA?
Flexible Server Postgres provides click-button experience to deploy HA configuration. You can also choose Azure CLI or ARM/SDK/Terraform to deploy your servers. By default, HA is enabled for Memory optimized SKUs (large production workload). Once you check the HA box and choose the deployment model, the service takes care of deploying the standby server within the same AZ or across AZs depending on your choice.
Figure 3: Screenshot of HA enablement and deployment models
Figure 3: Create screen experience to choose HA deployment
In regions where AZs are not supported, the only HA deployment model available will be the same-zone HA. You will not be able to choose the AZs in those regions.
Figure 4: Screenshot of same-zone selection
You can also do the following post server creation which you may or may not have enabled HA:
Change the HA deployment model (requires you to first disable HA and then choose a different model)
You can also test your application connectivity to the DB server, observe the application downtime during failovers, and improve the retry mechanism using on-demand Forced failover option. This will trigger a fault in your primary server and initiates the failover workflow. You can also use the Planned failover option to bring the primary server back to the preferred AZ.
Figure 6: Screenshot to perform on-demand Forced failover
Comparing Zone-redundant HA vs Same-zone HA
Standby server with synchronous replication
Storage – 3x redundant copy
Compute auto-restart after a failure
Reduced downtime during scheduled maintenance with HA
AZ-level protection for compute & storage
*Using the zone-redundant backup (if available in the region), you can do a point-in-time restore to a different AZ within the region.
What are the HA limitations?
See the documentation for the list of limitations when deploying HA with Flexible Server.
What about availability for non-HA servers?
Flexible server offers robust resiliency and availability capabilities for your databases even without configuring HA. You will still achieve the following benefits without incurring 2x the cost.
3x copies of data on premium managed disk with auto-repair capabilities.
Backup data on zone-redundant Azure BLOB storage. This provides zone-level resiliency where you can restore your data to another AZ in the event of your server’s AZ is down.
DB server is automatically restarted if that is down for any reason.
The compute VM is relocated automatically within the AZ due to issues such as node crash.
An uptime SLA of 99.9% is offered for non-HA deployments!! See here for details.
However, as you may have noticed, depending on the outage, you may encounter some downtime (longer RTO). For example, in the event of a node crash, until a new VM is provisioned your application will experience a downtime. This downtime may be acceptable for your test/dev environment. But for your mission-critical workloads that demand high uptime during planned and unplanned outages, it is highly recommended to deploy with HA configuration.
How about HA across regions?
Many of you asked about providing HA across regions capability. In the event of a regional fault, we term that as a Disaster recovery (DR) scenario. We currently have geo-redundant backup capability in preview and also planning to address geo-DR using asynchronous replication in future.