Auto Failover with PostgreSQL 12

Microsoft

Oct 17, 2019

Earlier this month the PostgreSQL community announced the release of PostgreSQL 12!

Given that the development of Postgres is all done in the open, we in the pg_auto_failover team could test the new code base ahead of time and discover some API changes in standby settings  this summer. You’ll be pleased to know that we released  pg_auto_failover 1.0.5 with Postgres 12 compatibility, and published our packages already.

Postgres recovery changes

In Postgres 12 there were changes around the recovery.conf file; the changes were such that the presence of the recovery.conf file would prevent your Postgres instance from starting at all. In previous versions of Postgres, the recovery.conf file was used to signal to PostgreSQL that we expect the server to remain in recovery mode, either as a PITR instance or as a standby instance. However, in Postgres 12 you signal that by using either the standby.signal or the recovery.signal file. As a result, the recovery and replication parameters can now be found in the usual setup for PostgreSQL, just like any other GUC.

To make pg_auto_failover compatible with Postgres 12, we adjusted the code to handle these new settings.

What is pg_auto_failover?

The pg_auto_failover project aims to provide fully automated HA for Postgres, in a simple and correct way. Simple to use. Correct implementation. That means the solution is robust, well tested, and easy to setup and get started with. Also anything that could be done automatically will be done automatically, and when the situation does not allow for an automated decision making process, then pg_auto_failover refuses to take any action.

A telling example of our approach to HA in pg_auto_failover can be found in the way we handle a primary server where Postgres is discovered not to be running, where we expect that it would be running, and it was known the be running before that. In that case, the first thing that pg_auto_failover does is to restart Postgres. Because that’s the simplest way to fix your production, and in many cases, it will just work. If that fails, pg_auto_failover continues trying ---- after all, your supervision script that frees some disk space on the WAL volume might need more time to be effective. Only after 3 consecutives failures or 20s spent trying to restart Postgres will pg_auto_failover bail out and failover to the secondary. And that only happens when the secondary is known to be available and all caught-up.

That’s just an example of course. I think it’s an important one in that it shows the spirit with which the pg_auto_failover solution has been implemented, and continues to be improved. Simple and correct.

Can I use pg_auto_failover in production now?

Yes! Some people actually use pg_auto_failover in production already, and happily so.

The pg_auto_failover project is already delivering a solid solution. We still have lots of ideas and ambitions in the area of automating Postgres HA, so there’s more to come! We have not implemented all the things yet at this time, and we focus on having a very solid solution for what we have. So to decide if you want to use pg_auto_failover, you should first understand what we have done at the moment and see if that matches your expectations in terms of features.

So, what it is that we have already done?

We support a single production architecture, with a hard-coded availability trade off that is giving priority to the service over the data in some situation. This allows us to remain very simple and robust.
Registering existing primary servers is possible, without service interruption. Just register your already running PostgreSQL instance as a primary server to the pg_auto_failover monitor and get started from there.
HBA editing is automated in pg_auto_failover, and you don't have to take care of it yourself. It might be that you have specific security rules to implement though, in which case you can of course edit the HBA yourself and discard the pg_auto_failover changes there.

What’s next for pg_auto_failover?

We have a long list of improvements in the work for  pg_auto_failover, and many more ideas for the future. We are still building some of the fundamentals in the area of Postgres HA, and working hard to implement a  fully automated HA solution for Postgres that is both simple and correct.

Stay tuned for more updates, including user facing improvements in terms of HA architectures and also a set of features targeted at docker and Kubernetes integration made (even) easier. The main items on our roadmap for the next releases now are:

support for multiple secondary servers
support for standby that are not candidate for failover
fully automated disaster recovery of the monitor and the Postgres nodes

native integration with docker, including a controller HTTP API
configuration management and syncing in between primary and standby

Remember that pg_auto_failover is fully Open Source. All the development happens in the open on GitHub. You are more than welcome to contribute, either by opening issues where you can let us know about shortcomings, bugs, or feature ideas, or by opening a Pull Request where you improve the product!

Updated Oct 09, 2020

Version 6.0

Microsoft

Joined October 07, 2019

View Profile

Azure Database for PostgreSQL Blog

Follow this blog board to get notified when there's new activity