PostgreSQL is a transactional database, and it keeps a record for all DML operations and transactions like Update, Insert, and Delete in WAL (Write-Ahead Log) file. WAL logs are PostgreSQL transaction log files used to ensuring data integrity. This log is written after changes to have been applied to the record and eventually once reaching the checkpoint threshold the log file will be flushed from memory to storage to save it permanently.
A checkpoint is a point in the write-ahead log sequence at which all data files have been updated to reflect the information in the log. All data files will be flushed to disk. Refer to WAL Configuration for more details about what happens during a checkpoint.
Why an Azure Database for PostgreSQL restarts?
The Azure Database for PostgreSQL – Single Server service provides a guaranteed high level of availability with the financially backed service level agreement (SLA) of 99.99% uptime. Azure Database for PostgreSQL provides high availability during planned events such as user-initiated scale compute operation, and when unplanned events such as underlying hardware, software, or network failures occur. Azure Database for PostgreSQL can quickly recover from most critical circumstances, ensuring virtually no application down time when using this service. Refer to High availability in Azure Database for PostgreSQL – Single Server
Causes of server restarts:
What causes long recovery on Azure Database for PostgreSQL
Recent checkpoints are critical for fast server recovery. Once a restart happens, either it was a new instance (failover to healthy instance) or same instance (in-place restart) will connect to disk that has all logs, all WAL logs after the last successful checkpoint need to be applied to the data pages before the server starts to accept connections again. Those logs are called REDO logs and will be applied via the recovery operation. Applying a WAL log runs through the following steps:
Recovery time depends on how recent the last checkpoint was and the amount of data inside those log files, that said, the best practice is that application developer needs to avoid log running transactions and tune checkpoint frequency to avoid long recovery.
Checkpoint frequency can be adjusted by configuring server parameters. controlled by checkpoint_completion_target which determines the total time between checkpoints. Another parameter that you may consider is bgwriter_delay which specifies the delay between activity rounds for the background writer.
Long running transactions:
Long running transactions are queries that are running for too long which impact database perfromance and can potentially cause issues during restarts. You may check all running transactions by querying pg_stat_activity, to list queries which running for more than 3 minutes, use the following query:
select current_timestamp-query_start as runtime,
datname,usename, query FROM pg_stat_activity
where state='active' and current_timestamp-query_start> '3 min'
order by 1 desc;
Please note that you can kill any long running PID using pg_terminate_backend. Let’s say you have PID “12345” and you want to kill this process, you may simply run the following query to kill it.
where pid = ‘12345’;
If you want to kill all process on the server, run the following command:
WHERE pg_stat_activity.datname = 'TARGET_DB' -- ← change this to your DB
AND pid <> pg_backend_pid();
How to Prepare for Azure Database for PostgreSQL planned server restart
Now that we have learned what causes long recovery, we will need to prepare for planned restarts either it was user initiate or system initiated.
Please note if your server is under status restarting, scaling, restoring please avoid initiating another restart, this will cause longer recovery. You may check the server status on Azure Portal Overview blade for your Azure Database for PostgreSQL Server.
Frequently Asked Questions
My Azure PostgreSQL server has 100% CPU or maxed out connections, what to do?
In this scenario, the best practice is to prepare for the server restart since this is a transactional database. First, work on reducing the workload from the application by stopping the application and then kill the running processes using pg_terminate_backend on the Azure PostgreSQL database. You may also consider stopping new incoming connetion by updating the firewall rules to prevent new connections. Once all the connections are termintated run a manual checkpoint and then restart the server.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.