postgresql
218 TopicsPostgres horizontal scaling with elastic clusters on Azure Database for PostgreSQL Flexible server
Elastic clusters (Preview) is a Flexible server feature that enables a database to be scaled out horizontally beyond a single node. Powered by the open-source Citus extension developed by Microsoft, elastic clusters offer horizontal scale out through row and schema-based sharding capabilities. With elastic clusters, you can future proof horizontal scale out, adding nodes as needed as the application grows and outgrows the capacity of a single node. By providing a unified endpoint to the cluster, elastic clusters eliminate the need for complex application-layer sharding, significantly simplifying the management of growing Flexible Server deployments. Online shard management and isolation capabilities are a core part of the services offloading this burden from the application stack. With elastic clusters, scaling out your applications has never been easier or more efficient. Benefits of using elastic clusters (Preview) Familiarity: It’s the same Flexible server you know and love, with the same feature set. Horizontal scaling: Grow your cluster out by adding more Flexible servers – lifting the single node limitation for your application growth. Simplicity: Offload complexity of sharding from your application layer to the managed service. Cost efficiency: You are not paying more than you would for the same amount of Flexible servers as nodes in your elastic cluster with the benefit of simplified management of the fleet. AI ready: Modern AI applications require storage of vast amounts of vectorized data. Their data models, however, are usually simple and scale out very well with sharding models provided by elastic clusters. What are elastic clusters? The elastic clusters feature of Azure Database for PostgreSQL Flexible server is a managed offering of horizontal scaling, powered by the Citus extension. Elastic clusters handle managing and configuring multiple PostgreSQL instances as a single resource, setting up the nodes, and providing cluster level choices for high availability, disaster recovery, compute and storage and resiliency. Nodes in your cluster reside in the same availability zones for optimal performance and the various Flexible server features and capabilities to become cluster aware such as backups, metrics, Entra ID authentication to just name a few. Every node in the elastic cluster is capable of handling DML, you can run your read and write queries on any node on the cluster, and it will automatically fetch the data from their current placement. In many distributed systems, often all queries go through a single node which can make it a bottleneck. With Elastic Clusters, however, when you connect via port 7432, the workload is automatically distributed across all nodes. What is sharding? Sharding is the act of splitting data into logical groups, that can then be moved into separate machines and retrieved individually. Sharding also supports queries that need data from two or more different shard groups living on different nodes. Such queries would be computed individually by each node and then rolled up by the initiating node before providing the result. You can also read more about sharding models here: Sharding models - elastic clusters Schema-based sharding Schema-based sharding is the most intuitive way to understand how this works. Every table in a distributed schema becomes its own shard. Shards can be co-located - moved around between nodes as a group. If your application creates a schema for every tenant in the system (or a microservice) all the data will live together on the same node and even if it moves around, the system will route queries to the correct node. The benefit is a pure architectural lift-and-shift for any application that already uses schema based sharding. Suddenly your legacy application can scale across many machines without a single line of code being changed. This is assuming that your application already uses schema based sharding but for database per tenant models the conversion to a schema-based model is easy. Follow the Tutorial: Design for microservices with elastic cluster to try schema-based sharding yourself! Row-based sharding Row-based sharding is different and requires some DDL modifications and choice of distribution key but with those few changes, it is a very scalable method when there is a need for high density packing of tenants. One way to think about row-based sharding is to paint parallels to table partitioning. With partitioning you decide what column can split the data into sensible chunks that can be queried in isolation – greatly reducing the amount of data you need to weed through to get results. A good example is partitioning by year, knowing that you rarely reach to past year. You then need to make sure that your queries all use the partition key in the WHERE clause, because without it the system doesn’t know how to filter out (prune) partitions that don’t hold the required data – neglecting the benefit of the partitioning. As with partitioning, row-based sharding required determining a distribution key. Selecting a good key can be a challenge itself – something with good cardinality (so data can be spread out), not skewed (so that data doesn’t fully land into one big shard) and logically splitting your tenants (avoiding cross tenant queries). Typical good distribution keys are tenant_id and device_id (to fan out writes) but this will heavily depend on your application. If you go back to the partitioning example, year was a great partition key, but it is a terrible distribution key. It has low cardinality, and only one node per year would be taking writes by not leveraging spreading out data to multiple shards. Instead of distributing on year, use a key with better cardinality in your system – you can still leverage partitioning on top of distribution as partitioned tables can be distributed. Having selected the key, the next step is easy. Just distribute the tables and this is as simple as calling: SELECT create_distributed_table('accounts', 'account_id'); SELECT create_distributed_table('campaigns', 'campaign_id'); There are many table types in elastic clusters, you can read more about them here: Table types - elastic clusters As with schema based sharding, each tenant is assigned into a shard but unlike row-based sharding a shard can be shared among tenants. Like with partitioning you can then think of a shard as a smaller PostgreSQL table and if you put all of them together – you would get the whole thing. In the same vein, as with partitioning if queries ran by the application miss the distribution key in their WHERE clause – they would start hitting all nodes in the cluster to find the required data. In some cases, such fan-out queries are desired and a form of implementing parallelization of query execution among all cluster nodes. In most cases, however, it is more efficient to have queries that filter down to one node. The biggest benefit of row-based sharding is the density of how tight tenants can be packed on the same machine. It’s as good as it gets, which is rows in the same table – the drawback, is the work up front required to decide on the schema and potential query changes in the application. Follow either tutorial: Design multitenant database with elastic cluster or Design a real-time dashboard with elastic cluster, to try out row-based sharding yourself! When things get hot There will be times where either the system needs room to grow (CPU, Storage, connections), or a specific tenant becomes very noisy (making other tenants experience degraded performance). Elastic clusters allow you to address both. In the case of capacity (either CPU, Storage, maximum connections) being reached, just grow your cluster by adding nodes and call trigger online rebalancer. Without blocking your existing workloads, data will be redistributed among the available nodes (including the new node) by disk size, shard count or any strategy that you decide to implement yourself. This is as easy as connecting to the cluster and issuing: SELECT citus_rebalance_start(); You may not need to rebalance depending on how you sharded your data. The new node will be immediately used for new inserts as soon as it becomes online and that may be just enough to redistribute the load on your cluster. When one tenant throws a party, you can decide to move them out to their own node though, sometimes it is easier to move the well behaving tenants if they happen to store less data than the noisy one. With elastic clusters this can be as easy as running: SELECT citus_schema_move('wideworldimporters','10.54.0.254', 7003); Using elastic clusters in Azure Database for PostgreSQL Flexible server Step 1: Create an elastic cluster Begin by setting up a new instance of elastic cluster with Azure Database for PostgreSQL Flexible server Azure portal Quickstart: Create elastic cluster Select your cluster size, up to 10 nodes can be deployed during preview. Review your settings after confirming the compute and storage configuration. Step 2: Choosing a sharding model Citus on Flexible server offers two sharding models: row-based sharding and schema-based sharding. You can learn more about sharding models here: Sharding models - elastic clusters Step 3: Follow one of the tutorials For row-based sharding: Design multitenant database with elastic cluster Design a real-time dashboard with elastic cluster For schema-based sharding: Design for microservices with elastic cluster Frequently Asked Questions In what regions are elastic clusters available? Elastic clusters are currently available in the following regions: East Asia East US North Europe UK South West US West US 3 Please see Limitations of elastic clusters for an up-to date list. Will all features of Flexible server be available on elastic clusters? We aim to feature parity, some cluster incompatible extensions like TimescaleDB may not be available. Other features may need a bit of work to become cluster aware. Are there any limitations I should be aware of? Please see Limitations of elastic clusters Ready to dive in? Get started for free with an Azure free account Azure Database for PostgreSQL Quickstart: Create elastic cluster with Azure portal Learn more Elastic clusters with PostgreSQL Flexible server Distributed Postgres. At any scale. - Citus Data citusdata/citus: Distributed PostgreSQL as an extension3.1KViews5likes3CommentsZero downtime migration from Oracle to Azure DB for PostgreSQL
Co-authored by: Maxim Lukiyanov, Principal PM Manager, Postgres on Azure, Microsoft Edward Bell, Senior Director, Global Technical Alliances & Solutions, Striim Matthew Burrows, Director, Databases Migration and Modernizations, Microsoft For organizations modernizing mission-critical applications, moving Oracle workloads to Microsoft Azure is a pivotal step toward realizing the benefits of a fully managed, cloud-native data architecture. Azure Database for PostgreSQL offers built-in high availability, elastic scale, enterprise security, and integration across Azure’s analytics and AI ecosystem—all without the operational overhead of managing legacy infrastructure. The challenge many teams face is how to migrate years of operational data from Oracle into Azure Database for PostgreSQL without introducing downtime, risk, or data drift. Microsoft and Striim solve this challenge together. Striim’s log-based Change Data Capture (CDC) continuously streams every Oracle transaction into Azure Database for PostgreSQL in real time to deliver zero-data-loss migration, continuous validation, and minimal impact on live applications. As part of the Microsoft Unlimited Database Migration Program, Striim is optimized to help organizations accelerate and de-risk their journey to Azure by combining enterprise-grade CDC technology, architectural best practices, and hands-on partner expertise. This joint solution empowers enterprises to modernize on Azure faster, with confidence, while paving the way for AI-ready architectures and long-term innovation. This tutorial walks through the architecture, deployment steps, and best practices for Oracle to Azure Database for PostgreSQL migrations using Striim. Why Use Striim for Continuous Migration The strategic partnership between Microsoft and Striim enables continuous data replication from existing databases into Azure in real time, enabling online migrations with zero downtime. Through this Unlimited Database Migration Program, customers gain unlimited Striim licenses to migrate as many databases as they need at no additional cost. Program highlights include: Zero-downtime, zero-data-loss migrations across SQL Server, Oracle, MySQL, PostgreSQL, Azure SQL and more. Support for heterogeneous, mission-critical workloads across relational and non-relational systems. Real-time, AI-ready data pipelines, preparing workloads for analytics and ML once migrated. With Striim, data moves continuously via log-based CDC, reading directly from Oracle transaction logs and replicating every insert, update, and delete to Azure in real time. This minimizes impact on production systems while maintaining full data consistency during migration. Architecture Overview The jointly recommended architecture consists of: Source: an existing Oracle instance (on-premises or hosted in another environment). Processing Layer: Striim, deployed in Azure for low-latency, secure data movement. Target: Azure Database for PostgreSQL (Flexible Server) Connectivity is established over standard Oracle (1521) and PostgreSQL (5432) ports, but can also utilize ports configured by the customers. Azure customers can use Private Link, VNets, and other native Azure networking controls to secure traffic throughout the migration. Step 1: Preparing the Oracle Source Before replication can begin, configure Oracle so that redo logs are accessible and a CDC user can be used by Striim. Striim supports multiple CDC methods for Oracle: Oracle Reader (LogMiner): Uses Oracle’s LogMiner API to read from redo logs. This is the default method. OJet: A high-performance CDC method, typically used for Oracle 21c or very high-volume workloads that generate large amounts of redo logs. GoldenGate trail files: If Oracle GoldenGate is already present, Striim can read from existing GoldenGate trail files rather than attaching another CDC process directly to the source database. All options are log-based, ensuring CDC overhead is typically in the low single-digit percentage range. Once chosen, ensure Striim can connect to Oracle over the JDBC listener port (default 1521). If Oracle is on-premises, this typically involves VPN/ExpressRoute or an SSH tunnel/jump host from Striim to the Oracle network before you can create a CDC user in Striim. Creating a CDC user (example) As an Oracle user, create a dedicated account for Striim and grant the required privileges for redo log access (exact grants may vary by version and CDC method). Please reference Striim’s Oracle CDC docs for the latest instructions: CREATE USER striim_user IDENTIFIED BY "StrongPassword!"; GRANT CONNECT, RESOURCE TO striim_user; -- Example privileges for redo/log views (adjust per security and Striim docs) GRANT SELECT ANY TRANSACTION TO striim_user; GRANT SELECT ON V_$LOG TO striim_user; GRANT SELECT ON V_$LOGFILE TO striim_user; GRANT SELECT ON V_$ARCHIVED_LOG TO striim_user; GRANT SELECT ON V_$DATABASE TO striim_user; Once the user is created and networking is in place, Striim can be configured to connect to Oracle using a JDBC URL such as: jdbc:oracle:thin:@//<oracle-host>:1521/<service_name> Step 2: Preparing Azure Database for PostgreSQL On the target side, you must provision Azure Database for PostgreSQL and ensure schema, networking, and permissions are in place before starting data movement. Schema creation Prior to starting the data movement pipelines, create tables in Azure Database for PostgreSQL corresponding to those in the Oracle source. This can be done with: Striim’s Schema Conversion Utility (CLI): Generates SQL DDL for the target system based on Oracle metadata. Striim’s wizard-based schema creation: Creates target tables directly from the Striim UI during pipeline configuration. External tools: You can also use existing schema migration utilities or custom DDL as required. Permissions Create an Azure DB for PostgreSQL user for Striim with write privileges on the target tables. For example: CREATE USER striim_user WITH PASSWORD 'strongpassword'; GRANT INSERT, UPDATE, DELETE, SELECT ON ALL TABLES IN SCHEMA public TO striim_user; ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT INSERT, UPDATE, DELETE, SELECT ON TABLES TO striim_user; Striim will use this user when running Database Writer components against Azure Database for PostgreSQL. Step 3: Building the migration pipeline A complete Oracle to Azure Database for PostgreSQL migration typically includes three coordinated stages: Schema Migration – Create equivalent tables and objects in Azure Database for PostgreSQL. Initial Load – Bulk-load historical data from Oracle into Azure. Change Data Capture (CDC) – Continuously stream real-time transactions, keeping source and target in-sync until production cutover. Initial load & continuous stream Once the schema is in place using one of the methods above, you can configure Striim to first bulk-load historical data, then switch to continuous CDC. During the initial load, Striim uses a Database Reader component to extract full tables from Oracle, and a Database Writer component to load them into Azure Database for PostgreSQL. The Striim applications can be built through a drag-and-drop interface, a wizard, or through a text-based interface, shown below. The Database Writer uses a user-defined batch policy to control how data is written. By default (e.g., EventCount:1000, Interval:60s), Striim will write to PostgreSQL when either 1,000 events are accumulated or 60 seconds have passed, whichever happens first. A conceptual example of an initial-load configuration using TQL might look like: CREATE SOURCE OracleInitialLoadReader USING DatabaseReader ( Username:'striim_user', Password:'strongpassword', ConnectionURL:'jdbc:oracle:thin:@//oracle-host:1521/ORCL', Tables:'HR.EMPLOYEES' ); CREATE TARGET PostgreSQLInitialLoadWriter USING DatabaseWriter ( Username:'striim_user', Password:'strongpassword', ConnectionURL:'jdbc:PostgreSQL://pg-host:5432/mydb', Mode:'Insert', BatchPolicy:'EventCount:10000,Interval:30s' ) INPUT FROM OracleInitialLoadReader; Continuous CDC stream After the initial load completes, Striim switches to a CDC pipeline that applies new Oracle transactions in real time until cutover. The CDC pipeline uses an Oracle CDC reader (Oracle Reader, OJet, or GoldenGate trail reader) and a Database Writer targeting Azure Database for PostgreSQL. A CDC example in TQL could look like: CREATE SOURCE OracleCDCReader USING OracleReader ( Username:'striim_user', Password:'strongpassword', ConnectionURL:'jdbc:oracle:thin:@//oracle-host:1521/ORCL', Tables:'HR.EMPLOYEES' ); CREATE TARGET PostgreSQLCDCWriter USING DatabaseWriter ( Username:'striim_user', Password:'strongpassword', ConnectionURL:'jdbc:PostgreSQL://pg-host:5432/mydb', Mode:'UpdateOrInsert', BatchPolicy:'EventCount:1000,Interval:60s' ) INPUT FROM OracleCDCReader; To ensure zero data loss, follow the sequencing guidelines in Striim’s documentation for switching from initial load to continuous replication. Cutover When the Azure Database for PostgreSQL environment is fully synchronized, and you are ready to move applications off Oracle: Pause writes to Oracle. Temporarily stop application writes to the source. Validate record counts. Compare row counts (and optionally checksums or spot checks) between Oracle and Azure Database for PostgreSQL. Striim has recently released a new tool called Validata to automatically do this. Please reach out to learn more. Redirect application traffic to Azure Database for PostgreSQL. Update connection strings so applications now point at the Azure PostgreSQL instance. Because Striim’s CDC keeps the source and target continuously in sync, the cutover window is typically very short, minimizing or eliminating downtime for end users. Adding Transformations and Smart Data Pipelines Beyond one-to-one replication, Striim allows you to enrich, transform, and validate data in flight using continuous SQL queries or custom Java processors. For example, you can append metadata for auditing: SELECT *, CURRENT_TIMESTAMP() AS event_time, OpType() AS operation FROM SQLServerStream; These Smart Data Pipelines enable in-stream transformations (e.g., deduplication, validation, or enrichment) without separate ETL jobs, streamlining modernization into a single continuous flow. Performance Expectations In joint Striim–Microsoft testing: 1 TB historical load typically completed in 4–6 hours, which can be further performance-tuned to decrease initial load time. CDC replication latency averages sub-second for inserts, updates, and deletes. Performance depends on schema complexity, hardware, and network configuration. For optimal results: Deploy Striim in the same Azure region as the target. Use private networking. Allocate adequate CPU and memory to handle peak CDC throughput. Support and Enablement The Microsoft Unlimited Database Migration Program is designed specifically to provide customers direct access to Striim’s field expertise throughout the migration process. From end-to-end, you can expect: Onboarding and ongoing support, including installation kits and walkthroughs. Higher-tier service packages are available as well. Direct escalation paths to Striim for issue resolution and continuous assistance during migration and replication. Professional services and funding flexibility, such as ECIF coverage for partner engagements, cutover or weekend go-live standby, and pre-approved service blocks to simplify SOW approvals. Together, these resources ensure migrations from Oracle to Azure DB for PostgreSQL are fully supported from initial enablement through post-cutover operations, backed by Microsoft and Striim’s combined teams. Whether your goal is one-time migration or continuous hybrid replication, Striim’s CDC engine, combined with Azure’s PostgreSQL Database, ensures every transaction lands with integrity. Start your modernization journey today by connecting with Striim directly through your Microsoft representative.End-to-end workload observability with Query Store for primary and replicas
Query performance doesn’t stop at the primary Most PostgreSQL architectures don’t run on a single node anymore. Reads get offloaded. Replica chains grow. And when performance issues hit, the hardest part is often simple: where did the queries actually run? With the latest query store capabilities in Azure Database for PostgreSQL flexible server, you can now capture workload executed not just on the primary, but also on read replicas—including cascading read replicas—and export the captured runtime stats, wait stats, and query text into Azure Monitor Logs (Log Analytics workspace / LAWS). See the real hotspot: isolate which node (primary vs replica) is slow. Know why: break down time by waits (CPU, I/O, locks) per query. Connect the dots: correlate query IDs to query text, and inspect sampled parameters locally in azure_sys on the primary when you need input context (parameters aren’t exported to LAWS). Centralize analysis: query everything with KQL in LAWS, across servers. What you’ll build This post walks through a reproducible demo that provisions a primary server, a read replica, and a cascading read replica, then runs a TPC-H–based workload across all three to generate query store data you can analyze locally and in Log Analytics. Enable query store capture (including query text) and parameter sampling for parameterized queries. Enable wait sampling so query store can record wait statistics. Export runtime stats, wait stats, and SQL text to LAWS using resource-specific tables. Validate capture on read replicas and cascading read replicas (not just the primary). Prerequisites Azure CLI logged in (az login) and permission to create a resource group, Log Analytics workspace, and PostgreSQL flexible servers. psql and curl available on your machine. PostgreSQL flexible server on General Purpose or Memory Optimized tier (query store and replicas aren’t supported on Burstable). PostgreSQL 14+ to test out cascading replicas. Networking: the script opens firewall access broadly for demos—tighten for production. Architecture (primary + replica chain + LAWS) You’ll deploy four resources: Primary server: read/write node. Read replica (level 1): read-only node created from the primary. Cascading read replica (level 2): read-only node created from replica level 1. Log Analytics workspace (LAWS): central place to query Query Store telemetry across all nodes. If Diagnostic Settings is properly configured, each server streams query store telemetry to LAWS—but how it’s kept locally differs by role. On the primary, query store data is recorded in-memory, then persisted locally in the azure_sys database, and then exported to LAWS. On read replicas (including cascading replicas), query store data is recorded in-memory only and then exported to LAWS. Bottom line: use LAWS for fleet-wide visibility, and use the primary’s azure_sys when you need deep local inspection (like parameter samples). Deploy the demo environment The fastest way to reproduce the scenario is to run the end-to-end bash script which you can download from https://raw.githubusercontent.com/Azure-Samples/azure-postgresql-query-store/refs/heads/main/may2026/script/query_store_demo.sh Save the file to a local directory in your Linux shell, and name the file query_store_demo.sh. To invoke the script, at minimum, you must assign a string password for the administrator login of the instances of the flexible servers it creates, and invoke the script like this: ADMIN_PASSWORD=<Your_Strong_Password> ./query_store_demo.sh Optionally, you can also override default values for other environment variables used by the script: Variable Purpose Default SUBSCRIPTION_ID Azure subscription ID to use (current default subscription) BASE_NAME Base name for all resources (used in naming servers, resource groups, etc.) pgqswait{YYYYMMDDHHMMSS} RESOURCE_GROUP Azure resource group name rg-{BASE_NAME} LOCATION Azure region for resources southeastasia PRIMARY_SERVER Name of primary PostgreSQL server {BASE_NAME}-primary REPLICA_1 Name of first-level read replica {BASE_NAME}-readreplica REPLICA_2 Name of second-level cascading read replica {BASE_NAME}-cascadereadreplica LOG_ANALYTICS_WORKSPACE Log Analytics workspace name law-{BASE_NAME} LOG_ANALYTICS_LOCATION Azure region for Log Analytics workspace southeastasia ADMIN_USER PostgreSQL admin username pgadmin ADMIN_PASSWORD PostgreSQL admin password (REQUIRED) SKU_NAME PostgreSQL server SKU (compute tier) Standard_D4ds_v5 TIER PostgreSQL pricing tier GeneralPurpose STORAGE_SIZE Storage size in GB 64 VERSION PostgreSQL version (minimum 14 for cascading replicas) 17 PRIMARY_DATABASE Initial database name postgres SQL_BASE_URL Base URL for downloading SQL scripts https://raw.githubusercontent.com/Azure-Samples/azure-postgresql-query-store/refs/heads/main/may2026/script/query_store_demo.sh TPCH_DDL_URL URL for TPC-H schema DDL file {SQL_BASE_URL}/schema/tpch_ddl.sql WORKLOAD_REPETITIONS Number of times to execute each workload query (minimum 5) 10 AUTO_APPROVE Skip confirmation prompt and proceed automatically false If, for example, you want to not only pass the ADMIN_PASSWORD but also override the LOCATION, you could do it like this: ADMIN_PASSWORD=<Your_Strong_Password> LOCATION=canadacentral ./query_store_demo.sh In a bit over 1 hour, the script will do the following steps: Step 1 — Provision first part of the infrastructure The infrastructure provisioned in this phase consists of: A resource group in which all resources are deployed. An instance of Log Analytics workspace, where all flexible server instances will send their query store related logs. A primary (read-write) flexible server. Step 2 — Configure primary server Now it's time to configure one new server parameters on your primary server so that query store emits query text to LAWS, so that we can correlate quey IDs to something recognizable. Query IDs are great for aggregation—but you still need the SQL. Turn on query text emission so you can correlate runtime and waits back to the actual statement text. Do this by setting pg_qs.emit_query_text to on. Refer to our documentation to learn how to set the value of a server parameter. Step 3 — Provision second part of the infrastructure The infrastructure provisioned in this phase consists of: A read replica (read-only) whose source is the primary server. A cascade read replica (read-only), whose source is the previously created read replica. Notice that when read replicas are created, they inherit the server parameter values from their source server. Because we have configured query store related settings on the primary server already, the intermediate read replica inherits its server parameters from that primary, and the cascade read replica inherits them from the intermediate replica. Step 4 — Export query store to Log Analytics (LAWS) Now for the payoff, we want to stream the data to Log Analytics so you can query across nodes, build dashboards, and alert. The script configures diagnostic settings on the primary and both replicas to send logs to a Log Analytics workspace using resource-specific tables. This is the key to cross-node visibility: each server exports its own captured telemetry, and you can slice by resource in a single KQL query. Query store runtime stats: execution counts, elapsed time, and other performance counters. Query store wait stats: wait breakdown attributed to queries. Query store SQL text: query text to decode query IDs. Note: Query store parameter samples are not included in the Log Analytics export. Parameters are stored locally per server in azure_sys, and on read replicas azure_sys is read-only—so don’t depend on replicas for parameter inspection. LAWS receives runtime stats, wait stats, and query text. Diagnostics settings for an instance of flexible server can be configured via portal. In the resource menu, under Monitoring, select Diagnostic settings. Add a new diagnostic setting, select a destination Log Analytics workspace, and the individual log catergories which you want to stream to that LAWS, and save the changes. For Destination table it's highly recommended to use Resource specific (one table per signal with proper schema) over Azure diagnostics (legacy one table for everything). With Azure diagnostics, all logs from all resource types land into a single table (AzureDiagnostics). It's a wide table with many columns. New columns get added as services emit new fields. If the 500 column limit is hit, extra fields go into the AdditionalFields column (a dynamic JSON). Querying on attributes stored in that column might have huge performance and query cost impact. The schema is inconsistent and difficult to discover. You must always filter events in that table by ResourceType and Category. On the other hand, with Resource specific, logs are written to separate tables per resource type and category. Therefore, each table has a well-defined schema and columns are strongly typed. Tables are smaller and faster to query. Queries on these tables are simpler don't need filtering by ResourceType and Category. Performance-wise, they also support faster ingestion and faster querying. They also support selecting different table plans and retention settings for each table. And, more importantly, role-based access control (RBAC) permissions can be applied at table level, allowing you to control access to telemetry in a more granular way. Note: If you want to see any of the images in this article in better quality, click on them to see them in their original size. This can also be configured using Azure CLI command az monitor diagnostic-settings create. Make sure that the --export-to-resource-specific parameter is set to true, which is the equivalent of selecting Resource specific for Destination table in portal UI. Setting this parameter to false, would mean that you want to use AzureDiagnostics as the destination table, which we don't recommend using. Step 5 — Run some workload In this phase the script loads a TPC-H schema and executes workload SQL across different nodes so that you can prove replica capture. Query it in Log Analytics Once the workload completed and data was streamed to Log Analytics, you can open your Log Analytics workspace, and start querying the relevant tables. If you don't know how to start issuing queries in a Log Analytics workspace, refer to Get started with log queries in Azure Monitor Logs. In your Log Analytics workspace, when you select Logs in the resource menu, you can access the Queries hub. By default, it should open automatically unless you have configured it to not show, in which case you can open by selecting Queries hub on the top right corner of the Logs home screen. If you add a filter in the queries hub for Resource type equals Azure Database for PostgreSQL Flexible Server, you'll be able to access multiple examples of queries which might help you get started querying the log categories we support for our service. You can run any of them by selecting Run on the summarization card that describes the query or, if you hover the mouse over the card, you can select Load to editor so that the query is copied over to the active query window, and you can run it or modified it further. Following, there are a few more query examples which can be useful to analyze the workload executed in this experiment. Top queries by total time (across all nodes) To get the list of 10 queries with higher duration from the ones that ran on any of the three nodes. KQL PGSQLQueryStoreRuntime | summarize total_time_ms = sum(TotalExecDurationMs) by QueryId, LogicalServerName | top 10 by total_time_ms desc Results Important: Results might be slightly different on each execution of the experiment. Where queries wait on each node List the most frequent wait events observed on user initiated queries across all nodes. KQL PGSQLQueryStoreWaits | join kind=inner (PGSQLQueryStoreRuntime) on QueryId | summarize total_waits_sampled = sum(Calls) by Event, EventType, LogicalServerName | order by total_waits_sampled desc Results Important: Results might be slightly different on each execution of the experiment. Decode query IDs (join runtime stats with SQL text) Top 20 queries the most frequent wait events observed on user initiated queries across all nodes. KQL PGSQLQueryStoreRuntime | join kind=inner (PGSQLQueryStoreQueryText) on QueryId | where QueryType == 'select' | project LogicalServerName, QueryId, TotalExecDurationMs, QueryText | top 20 by TotalExecDurationMs desc Results Important: Results might be slightly different on each execution of the experiment. Compare primary vs replicas (workload distribution) Find total number of query executions and accumulated duration of all those executions for each node. KQL PGSQLQueryStoreRuntime | summarize execs = sum(Calls), total_time_ms = sum(TotalExecDurationMs) by LogicalServerName | order by total_time_ms desc Results Important: Results might be slightly different on each execution of the experiment. Replica-only hotspots (find what’s slow off the primary) Find top 10 queries executed by their aggregated duration, focusing on what was executed on read replicas only. KQL let Replicas = dynamic(["pgqswait20260505220501-readreplica", "pgqswait20260505220501-cascadereadreplica"]); PGSQLQueryStoreRuntime | where LogicalServerName in (Replicas) | summarize total_time_ms = sum(TotalExecDurationMs) by QueryId, LogicalServerName | top 10 by total_time_ms desc Results Important: Results might be slightly different on each execution of the experiment. QPI now supports query store stats collected on replicas You can now use Query Performance Insight workbooks to analyze query store information not only on your primary server, as you were used to, but you can also get that valuable information on your read replicas. Why replica workload capture is a big deal This is the unlock: you can now answer performance questions in replica-heavy architectures without stitching together partial signals. Per-node truth: see the slow queries on the node where they actually ran (primary vs replica vs cascading replica). Faster root cause: runtime + waits gives you “slow” and “why” in one place. Replica tuning that sticks: identify replica-specific bottlenecks (I/O saturation, lock waits, CPU pressure) and tune with evidence. Centralized observability: export to LAWS so you can build dashboards, alerts, and cross-server comparisons with KQL. Unlock query visibility: Access query text without database permissions. Fine grain control on who can view query text: Using resource specific tables in LAWS, you can decide which users can access the table in which text of the queries is kept. Parameter-aware debugging: sampled parameters can help reproduce issues and explain plan changes, but they’re stored locally in azure_sys and not exported to LAWS. In practice, rely on the primary for parameter inspection (replicas have read-only azure_sys). Operational notes (quick but important) Expect a delay: Query store stats and LAWS ingestion aren’t instant. Give it a few minutes after running workload. Mind retention: Query store retention and Log Analytics retention are separate knobs. Tune them to balance troubleshooting value and cost. Production hygiene: don’t use wide-open firewall rules outside of a demo. Clean up When you’re done, delete the resource group: az group delete --name <RESOURCE_GROUP> --yes --no-wait Bottom line Query store in Azure Database for PostgreSQL flexible server now matches how modern architectures run—across primary, read replicas, and cascading replicas—and LAWS gives you a single place to query, compare, and act.252Views11likes0CommentsYou don't want to miss POSETTE: An Event for Postgres 2026 (T-6 weeks)
PostgreSQL has quietly become one of the most important pieces of modern application infrastructure — and in 2026, that momentum is just impossible to ignore. From AI‑powered applications and real‑time analytics to globally distributed systems and mission‑critical workloads, PostgreSQL is no longer “just” a relational database. It has evolved into a versatile, extensible platform that sits at the center of how modern systems are built. POSETTE: An Event for Postgres 2026is where that evolution comes into focus, please visit conference’s site to register and add the event to your calendar! This year’s conference is not about chasing hype or showcasing isolated features. It’s about how PostgreSQL is actually being built, operated, extended, and trusted in production — by the people who write the code, run it at scale, and depend on it every day. This post kicks off our POSETTE 2026 blog series. It introduces the major themes shaping the conference and sets the stage for deeper technical dives in the weeks ahead. PostgreSQL as the “Everything Database” One idea shows up again and again across the POSETTE 2026 program: PostgreSQL is becoming the everything database. Once viewed primarily as an OLTP engine, PostgreSQL now supports a surprising range of workloads and access patterns: Rich document models with JSON and JSONB Full‑text search and fuzzy matching Time‑series and event‑driven data Vector embeddings for AI and semantic search Graph traversal and property graph queries Analytical workloads through extensions and hybrid execution engines Instead of spreading data across specialized systems, many teams are choosing a different path: keep data closer together, reduce architectural sprawl, and let PostgreSQL adapt through extensions, planner improvements, and new query capabilities. At POSETTE 2026, you’ll see this evolution in practice — from Postgres acting as a lakehouse-style analytics engine, to SQL-native graph queries, to design patterns that simplify applications by leaning into Postgres’ breadth rather than working around it. One of the upcoming posts in this series will explore how far PostgreSQL can realistically go, where it shines, and where drawing clear boundaries still matters. PostgreSQL Meets AI: Agents, RAG, and Retrieval at Scale AI is no longer an experiment bolted onto databases after the fact. It’s changing how applications retrieve, interpret, and reason over data. A core theme at POSETTE 2026 is PostgreSQL’s role in AI‑driven systems, especially Retrieval‑Augmented Generation (RAG) and agent‑based workflows. Sessions throughout the conference dig into questions teams are actively struggling with today: How do you use PostgreSQL as a reliable retrieval backbone for AI systems? How do relational data, vectors, and graphs work together to provide better context? How do you expose databases to LLMs safely, without creating operational or security risks? How do you prevent accidental writes, unbounded queries, or runaway costs? The pattern emerging across these talks is clear: production‑grade AI depends far more on disciplined data access, safeguards, and observability than on the model itself. PostgreSQL’s extensibility, strong consistency, and transactional guarantees make it a compelling foundation when AI systems move beyond demos into real workloads. We’ll dedicate a full blog post in this series to PostgreSQL, MCP, RAG, and agent architectures, unpacking what works today and what’s coming next. Performance, Elasticity, and the Reality of Scale Scaling PostgreSQL isn’t just about adding more CPU. Modern workloads are bursty and unpredictable. APIs spike without warning. Background jobs collide with user traffic. Analytics compete with OLTP. Cloud pricing penalizes both over‑provisioning and reactive resizing. That’s why POSETTE 2026 puts a strong emphasis on performance realism — the practical realities teams face once systems leave the comfort zone of steady-state benchmarks. Across multiple sessions, you’ll see deep focus on topics such as: Why storage, not CPU, is often the real bottleneck How planner costs, vacuum behavior, and maintenance evolve in PostgreSQL 18 What changes — and what doesn’t — when moving from fixed capacity to elastic compute How to reason about distributed scaling without treating it as a magic switch These are not theoretical discussions. They reflect real production systems under sustained load, and the trade‑offs teams make when performance and reliability actually matter. Future posts in this series will break down common Postgres performance myths, highlight meaningful changes in recent releases, and share lessons learned from running PostgreSQL under pressure. Inside PostgreSQL: Internals, Evolution, and What’s Next POSETTE has always maintained a close connection to the PostgreSQL core community, and 2026 continues that tradition. This year’s agenda includes deep dives into: WAL, logical replication, and change data capture Major features landing in PostgreSQL 18, with early signals for 19 The engineering trade‑offs behind planner, executor, and I/O improvements Practical guidance for contributors getting started with PostgreSQL hacking These sessions offer a rare look behind the curtain — not just at what PostgreSQL does, but why it behaves the way it does, and how decisions make it from patch review to production release. If you’ve ever wondered how PostgreSQL keeps evolving at scale without losing its stability or community trust, POSETTE 2026 is where those answers take shape. Security and Operations: Running PostgreSQL Like It Matters As PostgreSQL adoption grows, expectations shift. “Works on my machine” isn’t enough. Databases must be secure by default, observable in production, and resilient when things go wrong. POSETTE 2026 includes sessions aimed squarely at operators, architects, and platform teams responsible for production systems, covering topics such as: PostgreSQL authentication and authorization models Designing least‑privilege schemas and safer defaults Managing large tables without accumulating hidden risk Avoiding subtle operational pitfalls that only appear at scale These are the talks that surface lessons learned the hard way — and help teams avoid repeating them. More Than a Conference: A PostgreSQL Community Event What ultimately sets POSETTE apart isn’t just the agenda. It’s the people. POSETTE 2026 brings together PostgreSQL contributors, extension authors, cloud engineers, platform teams, and practitioners from across the ecosystem. It’s a space to learn from each other, challenge assumptions, and reconnect with why PostgreSQL continues to thrive after decades of evolution. And this blog series is only just beginning. In the coming weeks, we’ll go deeper into individual themes, highlight standout sessions, and explore the ideas shaping PostgreSQL’s next chapter. If PostgreSQL matters to your work — or if it’s becoming central faster than you expected — POSETTE: An Event for Postgres 2026 is the event you don’t want to miss. Go ahead and register! Up next: Why PostgreSQL Is Becoming the “Everything Database.”82Views1like0CommentsWhen PostgreSQL v17 Chooses the Wrong Plan: A Deep Dive into CTEs with Data Skew
Common Table Expressions or CTEs provide temporary named result set that can be referenced within single complex SQL statements. CTE offer great flexibility in breaking down complex SQL queries into multiple readable parts and enables recursion. CTEs allows materialized result set which allows users to refer the result set multiple times. PostgreSQL 17 propagates column statistics from MATERIALIZED CTEs into the parent query, allowing the planner to estimate hash aggregation and join cardinalities more accurately. However, when queries combine highly skewed data distributions with runtime randomness (for example, ORDER BY random() LIMIT 1), this increased planner confidence can sometimes lead to execution plans that are correct in theory but catastrophic for specific runtime values. What is Data Skew and how it affects plans? Data skew is an uneven distribution of values in a column, where one or a few values occur far more often than the rest. To choose an efficient execution plan, PostgreSQL’s planner estimates row counts using statistics such as n_distinct, histograms, and most-common-values (MCV) frequencies stored in pg_statistic. If these estimates are off—especially if selectivity is overestimated—query performance can degrade significantly. Baseline Query and Observed Planner Behavior To gauge the performance impact of data skew with PGv17, we generated data into 3 tables relating to Accounts, Purchases and Audit. This baseline query selects a single random HIGH_RISK account and retrieves matching rows from a large audit_logs table via two materialized CTEs. Original Query with selected_account as materialized ( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1 ), audit_purchases as materialized ( select distinct p.account_id from purchases p join selected_account sa on p.account_id = sa.account_id ) select al.* from audit_logs al where exists ( select 1 from audit_purchases ap where al.account_id = ap.account_id); PostgreSQL 14 Execution Plan PostgreSQL 14 treats materialized CTEs as optimization fences and does not propagate detailed statistics into the parent query. As a result, the planner assumes relatively small CTE result sets and consistently favors nested loop joins with parameterized index scans. While these estimates are not strictly accurate, this conservative planning approach prevents full-table scans and avoids catastrophic performance under extreme data skew. In this workload, PostgreSQL 14 completes the query in approximately 7 ms. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Nested Loop (cost=1412.94..212757485.16 rows=1500600064 width=27) (actual time=5.226..7.192 rows=30 loops=1) Buffers: shared hit=277 read=13 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.460..4.460 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.459..4.459 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.011..3.533 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> HashAggregate (cost=317.44..355.57 rows=3813 width=4) (actual time=4.735..4.741 rows=1 loops=1) Group Key: p.account_id Batches: 1 Memory Usage: 217kB Buffers: shared hit=274 read=1 -> Nested Loop (cost=0.44..292.52 rows=9966 width=4) (actual time=4.722..4.726 rows=10 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.461..4.461 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..192.84 rows=9966 width=4) (actual time=0.260..0.262 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> HashAggregate (cost=85.79..87.79 rows=200 width=4) (actual time=4.743..4.744 rows=1 loops=1) Group Key: ap.account_id Batches: 1 Memory Usage: 40kB Buffers: shared hit=274 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..76.26 rows=3813 width=4) (actual time=4.737..4.742 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..817780.34 rows=24600001 width=27) (actual time=0.480..2.438 rows=30 loops=1) Index Cond: (account_id = ap.account_id) Buffers: shared hit=3 read=12 Planning Time: 1.121 ms Execution Time: 7.253 ms (37 rows) PostgreSQL 17 Execution Plan PostgreSQL 17 correctly estimates that the audit_purchases CTE may contain thousands of rows and that joining it with audit_logs could produce tens of millions of matches. Based on these estimates, the planner selects a Hash Semi Join with a Sequential Scan on audit_logs, which is optimal when many rows are expected to match. However, at runtime the selected account produces only ~30 matching audit rows. Despite this, PostgreSQL must execute the chosen plan fully, scanning millions of rows from disk. This results in an execution time of 521,625 ms (~8.7 minutes). Importantly, this behavior is not a cost estimation bug—the estimates are statistically correct. The failure occurs because the PostgreSQL 17 planner estimates a large matching result set based on the cardinality of values produced by the audit_purchases CTE, and consequently chooses a plan optimized for that estimated volume rather than for the single runtime-selected account. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Semi Join (cost=1691.60..93353890.16 rows=3001396992 width=26) (actual time=86577.858..521625.286 rows=30 loops=1) Hash Cond: (al.account_id = ap.account_id) Buffers: shared hit=28879 read=22040419 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.486..4.487 rows=1 loops=1) Buffers: shared hit=274 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.485..4.486 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=274 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.021..3.551 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> HashAggregate (cost=658.72..673.28 rows=1456 width=4) (actual time=4.907..4.912 rows=1 loops=1) Group Key: p.account_id Batches: 1 Memory Usage: 73kB Buffers: shared hit=277 read=1 -> Nested Loop (cost=0.44..606.86 rows=20742 width=4) (actual time=4.898..4.902 rows=10 loops=1) Buffers: shared hit=277 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.487..4.487 rows=1 loops=1) Buffers: shared hit=274 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=0.410..0.411 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> Seq Scan on audit_logs al (cost=0.00..52082989.92 rows=3001396992 width=26) (actual time=0.012..286755.571 rows=3001400050 loops=1) Buffers: shared hit=28602 read=22040418 -> Hash (cost=29.12..29.12 rows=1456 width=4) (actual time=4.919..4.920 rows=1 loops=1) Buckets: 2048 Batches: 1 Memory Usage: 17kB Buffers: shared hit=277 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..29.12 rows=1456 width=4) (actual time=4.908..4.911 rows=1 loops=1) Buffers: shared hit=277 read=1 Planning: Buffers: shared hit=197 Planning Time: 1.831 ms Execution Time: 521625.433 ms (38 rows) Forced nested loop Disabling hash joins at the session level forces PostgreSQL 17 to revert to a nested loop plan using index scans. This diagnostic step completes in ~8 ms, confirming that index-based execution is sufficient for the actual runtime workload. However, disabling planner features is not suitable for production use and serves only to validate the root cause. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=1677.62..3361930001.36 rows=3001396992 width=26) (actual time=5.513..8.194 rows=30 loops=1) Buffers: shared hit=277 read=12 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.524..4.524 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.523..4.523 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.014..3.608 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> HashAggregate (cost=658.72..673.28 rows=1456 width=4) (actual time=4.890..4.894 rows=1 loops=1) Group Key: p.account_id Batches: 1 Memory Usage: 73kB Buffers: shared hit=274 read=1 -> Nested Loop (cost=0.44..606.86 rows=20742 width=4) (actual time=4.882..4.885 rows=10 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.525..4.525 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=0.354..0.356 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> HashAggregate (cost=32.76..47.32 rows=1456 width=4) (actual time=4.896..4.899 rows=1 loops=1) Group Key: ap.account_id Batches: 1 Memory Usage: 73kB Buffers: shared hit=274 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..29.12 rows=1456 width=4) (actual time=4.892..4.895 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..1782455.82 rows=52656088 width=26) (actual time=0.615..3.283 rows=30 loops=1) Index Cond: (account_id = ap.account_id) Buffers: shared hit=3 read=11 Planning Time: 0.184 ms Execution Time: 8.252 ms (37 rows) Fix Strategies While the fix isn’t straightforward, we attempt to modify the query plan through query rewrite using following methods: Lateral Join with Offset 0 Limit on Purchases Using subqueries Lateral Join with Offset 0 Adding OFFSET 0 inside a LATERAL subquery introduces an optimizer barrier that prevents join reordering. This forces PostgreSQL to execute a parameterized nested loop, passing the runtime account ID into an index scan on audit_logs. With this rewrite, PostgreSQL 17 executes the query in 6.519 ms, a 99.9988% improvement over the original plan. with selected_account as materialized ( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1 ), audit_purchases as materialized ( select p.account_id from purchases p join selected_account sa on p.account_id = sa.account_id ) select al.* from audit_purchases ap join lateral ( select * from audit_logs al where al.account_id = ap.account_id offset 0) al on true; Execution plan QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=1578.44..42162640233.98 rows=1092192577296 width=26) (actual time=4.980..6.481 rows=300 loops=1) Buffers: shared hit=403 read=12 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.296..4.297 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.295..4.296 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.013..3.357 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> Nested Loop (cost=0.44..606.86 rows=20742 width=4) (actual time=4.585..4.588 rows=10 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.298..4.298 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=0.286..0.287 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..414.84 rows=20742 width=4) (actual time=4.587..4.591 rows=10 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..1506157.19 rows=52656088 width=26) (actual time=0.040..0.185 rows=30 loops=10) Index Cond: (account_id = ap.account_id) Buffers: shared hit=129 read=11 Planning: Buffers: shared hit=8 Planning Time: 0.238 ms Execution Time: 6.519 ms (31 rows) Limit on Purchases Applying LIMIT 1 to the audit_purchases CTE bounds its cardinality to a single row. This makes nested loops cheaper than hash joins during planning. PostgreSQL 17 selects an index‑driven execution plan and completes in 7.742 ms, yielding a 99.9985% improvement. with selected_account as materialized ( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1 ), audit_purchases as materialized ( select p.account_id from purchases p join selected_account sa on p.account_id = sa.account_id limit 1 ) select al.* from audit_logs al where exists (select 1 from audit_purchases ap where al.account_id=ap.account_id); Execution plan QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=972.07..933697.99 rows=24600001 width=27) (actual time=5.693..7.686 rows=30 loops=1) Buffers: shared hit=276 read=13 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.816..4.817 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.816..4.817 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.014..3.825 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> Limit (cost=0.44..0.47 rows=1 width=4) (actual time=5.133..5.134 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Nested Loop (cost=0.44..292.52 rows=9966 width=4) (actual time=5.132..5.132 rows=1 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.817..4.817 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..192.84 rows=9966 width=4) (actual time=0.313..0.313 rows=1 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> HashAggregate (cost=0.02..0.03 rows=1 width=4) (actual time=5.138..5.139 rows=1 loops=1) Group Key: ap.account_id Batches: 1 Memory Usage: 24kB Buffers: shared hit=274 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..0.02 rows=1 width=4) (actual time=5.135..5.135 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..686726.47 rows=24600001 width=27) (actual time=0.552..2.534 rows=30 loops=1) Index Cond: (account_id = ap.account_id) Buffers: shared hit=2 read=12 Planning Time: 0.216 ms Execution Time: 7.742 ms (35 rows) Using subqueries Rewriting the query using scalar subqueries converts the selected account ID into an InitPlan. The resulting value is then used as a runtime parameter in index scans against audit_logs. This eliminates join reordering opportunities and guarantees a parameterized access path. Both scalar subquery variants complete in approximately 7.3–7.6 ms, achieving over 99.9986% improvement. Subquery Example 1 select * from audit_logs al where al.account_id in (select p.account_id from purchases p where p.account_id=( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1)); Execution plan QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop Semi Join (cost=972.02..1302907.98 rows=24600001 width=27) (actual time=5.257..7.260 rows=30 loops=1) Buffers: shared hit=277 read=12 InitPlan 1 (returns $0) -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.612..4.613 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.611..4.611 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.012..3.674 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..686755.98 rows=24600001 width=27) (actual time=5.067..7.059 rows=30 loops=1) Index Cond: (account_id = $0) Buffers: shared hit=274 read=11 -> Materialize (cost=0.44..242.67 rows=9966 width=4) (actual time=0.006..0.006 rows=1 loops=30) Buffers: shared hit=3 read=1 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..192.84 rows=9966 width=4) (actual time=0.187..0.188 rows=1 loops=1) Index Cond: (account_id = $0) Heap Fetches: 0 Buffers: shared hit=3 read=1 Planning Time: 0.160 ms Execution Time: 7.286 ms (24 rows) Subquery Example 2 select * from audit_logs al where al.account_id=(select distinct p.account_id from purchases p where p.account_id=( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1)); Execution plan QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Index Scan using idx_account_audit on audit_logs al (cost=972.04..1507128.65 rows=52656088 width=26) (actual time=5.510..7.637 rows=30 loops=1) Index Cond: (account_id = (InitPlan 2).col1) Buffers: shared hit=277 read=12 InitPlan 2 -> Limit (cost=971.44..971.46 rows=1 width=4) (actual time=4.901..4.902 rows=1 loops=1) Buffers: shared hit=274 read=1 InitPlan 1 -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.455..4.456 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.455..4.455 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.018..3.522 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=4.900..4.900 rows=1 loops=1) Index Cond: (account_id = (InitPlan 1).col1) Heap Fetches: 0 Buffers: shared hit=274 read=1 Planning: Buffers: shared hit=4 Planning Time: 0.154 ms Execution Time: 7.663 ms (25 rows) Query execution summary The table below summarizes execution time improvements across all tested rewrites relative to the original PostgreSQL 17 and 14 query execution plans. Setup / Rewrite Execution Time (ms) Speed‑up Factor (×) PG17 original (Hash Semi Join + Seq Scan) 521,625.433 1.0× PG14 original query shape 7.253 71,919× Forced nested loop (hashjoin off – diagnostic) 8.252 63,212× LATERAL JOIN + OFFSET 0 6.519 80,016× LIMIT on purchases 7.742 67,376× Subquery – IN with InitPlan (Example 1) 7.286 71,593× Subquery – scalar subquery (Example 2) 7.663 68,071× Key Takeaways While PostgreSQL 17’s planner improvements are correct, intentional, and beneficial for most workloads. However, queries that combine CTEs, runtime randomness, and heavily skewed data can cause the planner to select globally optimal plans that are locally inefficient. When runtime values drastically narrow result sets, query rewrites that force parameterized index scans—such as LATERAL joins, scalar subqueries, or bounded CTEs—provide predictable and stable performance.221Views0likes0CommentsPgBouncer Best Practices in Azure Database for PostgreSQL – Part 1
Introduction Connection pooling is critical for scaling PostgreSQL workloads efficiently, especially in managed environments like Azure Database for PostgreSQL. PgBouncer, a lightweight connection pooler, helps manage thousands of client connections without overwhelming the database. Connection pooling is very important when managing multiple concurrent database requests, as PostgreSQL uses a process-per-connection model, which means too many active connections can: Increase context switching overhead Consume excessive CPU/memory Degrade performance under load PgBouncer addresses this by limiting active server connections and queuing the additional client requests. However, misconfiguring key settings such as default_pool_size can still lead to CPU/memory pressure, connection bottlenecks, and degraded performance. Careful planning and tuning are essential to avoid these pitfalls. Understanding connection pools Before diving into tuning, it’s important to understand how PgBouncer organizes connections: PgBouncer creates a separate pool for each unique (database, user) combination. For example: If you have 2 application roles/users connecting to 2 databases. In this scenario, PgBouncer will allocate 4 pools. Each pool maintains its own number of connections, determined by default_pool_size. So, the total number of potential server connections is: number_of_pools × default_pool_size This is why sizing default_pool_size correctly is critical. Azure PgBouncer defaults Azure Database for PostgreSQL comes with preconfigured PgBouncer settings optimized for most workloads. Understanding these defaults is essential before making any tuning changes: pool_mode: TRANSACTION (default in Azure; best for most workloads) default_pool_size: 50 (range: 1–4950) max_client_conn: 5000 (range: 1–50000) Transaction mode support for prepared statements PgBouncer now enables support for PostgreSQL PREPARED STATEMENTS when combined together with TRANSACTION mode pooling. Previously, in transaction mode cached plans were difficult to manage, as there was no way for PgBouncer to confirm whether a new connection allocated from the pool would benefit from any cached plans generated from prior PREPARED STATEMENT operations. To work around this scenario, PgBouncer now provides a parameter which controls how many globally cached plan statements remain in memory for any pooled connection to leverage. max_prepared_statements: 200 (range: 0-5000) PostgreSQL connection limits For large tiers (e.g., 96 vCores), the default max_connections is 5000, with 15 reserved for system use. That means 4985 user connections are available. For more details, see maximum connection. Sizing best practices Proper sizing ensures optimal performance and resource utilization. Here’s how to approach it: 1. Use transaction pooling Start by confirming that pool_mode = TRANSACTION is enabled. This is already the Azure default and provides the best pooling efficiency for most web applications. If your application is using prepared statements, ensure you configure max_prepared_statements accordingly. 2. Determine your maximum active concurrent database operations (max_concurrent_ops) Next, you need to estimate how many total concurrent active PostgreSQL backends your instance can maintain: For CPU-bound OLTP workloads: keep max_concurrent_ops near 1.5x -2x the number of CPU vCores. For I/O-heavy workloads: stay slightly higher than vCore count. Rule of thumb for 96 vCores: max_concurrent_ops ≈ 144–192. 3. Divide across pools Once you’ve estimated your max_concurrent_ops value, the next step is to distribute your capacity across all connection pools. default_pool_size ≈ max_concurrent_ops / number_of_pools Example: max_concurrent_ops = 144 number_of_pools = 4 default_pool_size = 144 / 4 = 36 Sample configuration To illustrate how these calculations translate into real-world settings, here’s a sample PgBouncer configuration tuned for a scenario with four pools and an Active_Backend_Target of 144. pool_mode = transaction default_pool_size = 36 ; tuned for 4 pools max_client_conn = 5000 Quick reference table For quick planning, the following table provides starting recommendations based on common Azure Database for PostgreSQL SKU sizes. Use these as a baseline and adjust according to your workload metrics. SKU Size Memory Default max_connections Pools Suggested max_concurrent_ops Starting default_pool_size 8 vCores 32 GiB 3437 2 12–16 6–12 16 vCores 64 GiB 5000 2 24–32 12–20 32 vCores 128 GiB 5000 2 48–64 30–40 48 vCores 192 GiB 5000 2 72–92 40–60 64 vCores 256 GiB 5000 2 96–128 50–70 96 vCores 384–672 GiB 5000 2 144–192 60–80 For all tiers ≥16 vCores, max_connections is capped at 5000 (with 15 reserved for system use). Notes: default_pool_size = max_concurrent_ops / number_of_pools These values are starting recommendations. You should validate them against actual workload metrics and adjust gradually. Always ensure: (number_of_pools × default_pool_size) < max_connections − 15 (reserved system slots) Monitoring and tuning After applying your configuration, continuous monitoring is key. Here’s how: Use PgBouncer metrics in Azure Monitor to track active, idle, and waiting connections. Run SHOW POOLS; for real-time stats; watch cl_waiting vs sv_idle. For detailed monitoring and management, visit the Admin Console. Recommended Alerts: Alert if waiting client connections > 0 while idle server connections = 0 (indicates pool exhaustion—consider increasing default_pool_size). Alert if active server connections approach the configured default_pool_size (may indicate need for tuning). Alert if max_client_conn utilization exceeds 80% (risk of client-side connection errors). Tip: If waiting client connections grow while idle server connections are zero, increase default_pool_size cautiously. Review performance regularly and adjust gradually. Common pitfalls Avoid these mistakes when configuring PgBouncer: Changing pool mode to SESSION by default: transaction pooling is better for most apps. Session mode will not release connections until the session is ended. Ignoring pool count: multiplying a large default_pool_size by many pools can exhaust connections. Confusing max_client_conn with Postgres capacity: PgBouncer can accept many more clients than the server concurrent processes can support, any client connections not being processed will be waiting for resources. Tuning without data: always review metrics before changes. Conclusion Choosing the right default_pool_size in Azure Database for PostgreSQL with PgBouncer is about balancing performance and resource efficiency. With built-in PgBouncer in Flexible Server, you can enable connection pooling with a single parameter making it easy to get started quickly. The default settings are optimized for most workloads, and as your requirements grow, you can further tune parameters like default_pool_size and max_connections to suit your needs. By understanding your workload, estimating an active concurrent operations, dividing it across pools while respecting PostgreSQL limits, and continuously monitoring and adjusting based on real data, you can achieve a stable, scalable, and cost-effective environment. Further reading For more in-depth guidance and real-world scenarios on PgBouncer configuration and tuning in Azure Database for PostgreSQL, explore the following resources: Leverage built-in PgBouncer in Flexible Server Monitoring PgBouncer in Azure PostgreSQL Flexible Server Identify and solve connection performance in Azure Postgres Not all Postgres connection pooling is equal Connection handling best practice with PostgreSQLPotential Consequences of Using Postgres as a Job Queue
Introduction At small scale, using Postgres as a job queue is totally fine, and I’d even say it’s the right call. Fewer moving parts, one less system to manage, ACID guarantees on your jobs. What’s not to love? The problem is that “small scale” has a ceiling, and the ceiling is lower than most people expect. When you’ve got thousands of concurrent workers hammering a jobs table with SELECT ... FOR UPDATE SKIP LOCKED , things start to behave in ways that aren’t obvious from the application layer. CPU usage creeps up. Also vacuum sometimes can’t keep up. Finally, in the wait event stats, you start seeing ominous entries like LWLock:MultiXactSLRU stacking up across many backends. This pattern has tripped up teams more than a few times, and it usually plays out the same way: everything works fine in dev and staging, then goes off a cliff in production once the concurrency gets real. So let’s dig into why this happens, and what the alternatives look like. The Typical Pattern When using Postgres as a job queue, the standard approach looks something like this: CREATE TABLE job_queue ( id bigserial PRIMARY KEY, status text NOT NULL DEFAULT 'pending', payload jsonb NOT NULL, created_at timestamptz NOT NULL DEFAULT now(), locked_by text, locked_at timestamptz ); CREATE INDEX idx_job_queue_status ON job_queue (status) WHERE status = 'pending'; Workers grab jobs with: UPDATE job_queue SET status = 'processing', locked_by = 'worker-42', locked_at = now() WHERE id = ( SELECT id FROM job_queue WHERE status = 'pending' ORDER BY created_at LIMIT 1 FOR UPDATE SKIP LOCKED ) RETURNING *; And then mark them done: UPDATE job_queue SET status = 'completed' WHERE id = $1; Some users may DELETE the row entirely. Either way, the lifecycle is: insert, lock-and-update, update-or-delete. Repeated thousands of times per second. At low concurrency, this works very smoothly. SKIP LOCKED means workers don’t block each other waiting for the same row. Postgres handles the locking, visibility, and ordering. It’s elegant. So where does it break? The MultiXact SLRU Problem When multiple transactions hold locks on the same row, Postgres stores the set of lockers as a MultiXact ID – a pointer into a side structure under pg_multixact/ . With SELECT ... FOR UPDATE SKIP LOCKED , users might think MultiXacts aren’t involved – after all, SKIP LOCKED is supposed to avoid contention. But in practice, with many concurrent workers all racing to lock rows, there are brief windows where multiple transactions reference the same row before one of them “wins” and the others skip. If you combine this with any FOR SHARE or FOR KEY SHARE locks (which are commonly created implicitly by foreign key checks), MultiXact IDs start accumulating quickly. The MultiXact data lives in SLRU buffers (Simple Least Recently Used) – a small, fixed-size shared memory cache. When backends need to read or write MultiXact data, they acquire LWLocks to access these buffers. Under high concurrency, this becomes a bottleneck: wait_event_type | wait_event -----------------+------------------- LWLock | MultiXactMemberSLRU LWLock | MultiXactOffsetSLRU You’ll see dozens or hundreds of backends piled up on these waits. The SLRU cache is small (by design – it’s a fixed number of pages in shared memory), and when the working set of MultiXact lookups exceeds what fits in the cache, you get constant eviction and re-reads from disk. Every lock acquisition and release on a job row potentially triggers a MultiXact SLRU lookup, and at thousands of concurrent sessions, those lookups serialize on LWLocks. The result: CPU gets pegged, throughput collapses, and latency spikes – not because the queries are expensive, but because the locking infrastructure itself is overwhelmed. Bloat: The Silent Killer The other side of this coin is table and index bloat. Every job row goes through multiple updates (and possibly a delete), and each of those operations creates a new tuple version in the heap. The old versions stick around until VACUUM cleans them up. On a busy job queue table: Dead tuples accumulate faster than autovacuum can clean them. By the time autovacuum finishes one pass, tens of thousands of new dead tuples have appeared. The table grows and grows. Index bloat compounds the problem. Every index on the table also accumulates dead entries. The partial index on status = 'pending' gets thrashed especially hard, since rows constantly enter and leave that condition. Sequential scans get slower. As the table bloats, even index scans start doing more I/O because the heap pages are sparsely populated. Vacuum reclaims space at the end of the table, but can’t reclaim space in the middle (unless the pages are completely empty). Job queue tables can grow to tens of gigabytes when the actual “live” data was only a few megabytes. It makes everything slower: scans, vacuum, even pg_dump. You can mitigate this by running vacuum more aggressively (lower autovacuum_vacuum_scale_factor , higher autovacuum_vacuum_cost_limit ), or by partitioning the table and dropping old partitions. But at some point, you’re fighting the fundamental mismatch between MVCC’s design goals and the write pattern of a job queue. CPU and Lock Overhead Beyond the SLRU contention and bloat, there’s just the raw overhead of using Postgres’s full transactional machinery for what is essentially a FIFO dispatch operation: Every lock/unlock is a full WAL-logged transaction. Grabbing a job writes WAL. Marking it complete writes WAL. Deleting it writes WAL. On a system processing thousands of jobs per second, the WAL volume from the job queue alone can saturate your wal_writer and checkpoint processes. SKIP LOCKED still touches rows. The name suggests rows are skipped, but Postgres still has to find them, check their lock status, and move on. With high concurrency, many workers end up scanning past the same locked rows before finding one they can claim. This is wasted CPU. Snapshot management overhead also becomes an issue. Each transaction needs a consistent snapshot, and with thousands of concurrent transactions, the ProcArray (the structure that tracks active transactions) becomes a contention point itself. You might see LWLock:ProcArrayLock waits alongside the MultiXact ones. Vacuum contention. While vacuum is cleaning up dead tuples, it needs locks too. On a table under constant write pressure, vacuum can interfere with the workers and vice versa. I’ve seen systems where disabling autovacuum on the job queue table improved throughput in the short term. Better Alternatives So what should you use instead? It depends on your requirements, but there are several options that handle high-throughput job dispatch more gracefully than a Postgres table. Advisory Locks (Staying in Postgres) If you want to stay within Postgres and avoid adding infrastructure, advisory locks are worth considering for certain queue patterns. Instead of locking rows, you lock on an abstract numeric key: -- Worker tries to acquire a lock on the job ID SELECT pg_try_advisory_lock(id) FROM job_queue WHERE status = 'pending' ORDER BY created_at LIMIT 1; Advisory locks are lightweight – they don’t touch the heap, don’t create MultiXact entries, and don’t generate dead tuples. They live entirely in shared memory. The trade-off is that you lose the atomicity of FOR UPDATE SKIP LOCKED : you need to handle the case where a lock is acquired but the job processing fails, and you need to release the lock explicitly (or rely on session-end cleanup). This approach works well when the queue depth is manageable and you want to avoid the MVCC overhead. But it’s still Postgres, so you’re still subject to connection limits, ProcArray overhead, and general resource contention at very high session counts. pgq (Skytools) pgq is purpose-built for exactly this problem. It’s a queue implementation that sits inside Postgres but uses a batching model that avoids most of the row-level locking and MVCC pitfalls. Events are written to a queue table, but consumers read them in batches and the queue maintenance is done via a ticker process that manages rotation. The key advantages: No row-level contention. Consumers don’t lock individual rows. Built-in batch processing. Events are consumed in chunks, reducing transaction overhead. Efficient cleanup. Old events are rotated out rather than vacuumed row-by-row. The downside is that pgq is not as actively maintained as it once was, and it adds operational complexity (the ticker daemon, consumer registration, etc.). But for teams already deep in the Postgres ecosystem, it’s a battle-tested option. Redis For many teams, Redis is the natural choice for job queues. Using Redis lists (BRPOPLPUSH or the Streams API), you get: Sub-millisecond dispatch latency. No disk I/O, no MVCC, no vacuum. Atomic pop operations. Workers grab jobs without any locking protocol. Simple scaling. Redis handles thousands of concurrent consumers trivially. The trade-off is durability. Redis can persist to disk, but it’s not ACID. If Redis crashes between a pop and the job completing, you might lose or duplicate work (though Redis Streams with consumer groups mitigate this significantly). For most job queue use cases, at-least-once delivery is acceptable, and Redis does that well. Kafka For truly high-throughput, distributed workloads, Apache Kafka is the heavyweight option. Kafka partitions give you parallel consumption with ordering guarantees per partition, durable storage, and replay capability. It’s the right tool when: You need to process thousands of events per second Multiple consumers need to read the same events You want event replay or audit trails Your architecture is already event-driven The operational overhead is nontrivial – ZooKeeper (or KRaft), brokers, topic management, consumer group coordination. But for teams already running Kafka for other reasons, adding a job queue topic is practically free. Choosing the Right Tool Here’s a rough decision guide: Under 100 concurrent workers, simple jobs, Postgres with SKIP LOCKED is fine Moderate concurrency, want to stay in Postgres, Advisory locks or pgq High throughput, low-latency dispatch, Redis (Lists or Streams) Massive scale, distributed, event replay, Kafka Many teams that start with Postgres (reasonably) hit scaling problems and then try to fix Postgres rather than recognizing that the workload has outgrown the tool. They throw more autovacuum workers at it, increase max_connections , add connection poolers – all of which help at the margins, but don’t address the fundamental issue: Postgres’s MVCC and locking machinery wasn’t designed for this access pattern at high concurrency. Conclusion Postgres is great, but it can’t be the best tool for every job. Using it as a job queue is a perfectly valid choice when your scale is modest. But when you’re running thousands of concurrent workers, the combination of MultiXact SLRU contention, heap bloat, vacuum pressure, and raw locking overhead will eventually push you toward a purpose-built solution. The good news is that you don’t have to rip out everything. Advisory locks can buy you headroom without adding infrastructure. Redis can handle dispatch while Postgres keeps owning the data. And if you’re already using Kafka, a job topic is a natural fit. Take your pick – there are many queueing options out there!287Views3likes0CommentsConnection Scaling in Elastic Clusters
As your applications grow, you need to manage database connections carefully to keep performance predictable and reliable. Azure Database for PostgreSQL Elastic Clusters, powered by Citus, support horizontal scaling by distributing data and queries across multiple nodes. This raises a key question: how do connections behave as you add more clients, grow the cluster, or upgrade node specifications? In this post, you’ll see how connection handling behaves in Elastic Clusters through controlled benchmarks across a small set of configurations. Core count: 2 and 4 Cluster node count: 4 and 8 Using these setups, we measured how throughput, latency, and resource usage change as connection counts increase for different workloads. The goal is not just to show numbers, but to explain why those numbers behave the way they do. We deliberately chose small SKUs (2 cores/8 GB and 4 cores/32 GB) because it is easier to observe how a smaller compute responds as connection counts grow. These results help you understand: What influences connection capacity When scaling up helps more than scaling out How to tune your cluster to match real workloads Future posts will expand this analysis to more configurations. For now, these insights can help you make informed decisions about cluster sizing and connection management. What you'll learn: How single-shard and multi-shard queries behave under increasing connection loads The impact of scaling up (more cores per node) vs. scaling out (more nodes) When PgBouncer helps—and when it hurts Practical limits imposed by memory, CPU, and connection parameters Configuration guidelines for different workload patterns How Elastic Clusters Handle Client and Internal Connections Before we dive into the performance results, let’s first look at how Elastic Clusters handle connections and execute queries. Cluster Architecture: Where Connections Go In Elastic Clusters, data is distributed across multiple nodes using Citus. Each node can accept client connections and execute queries. This is true even when the data lives on another node. This design allows you to scale horizontally while staying compatible with PostgreSQL. When your application opens a connection, the flow looks like this: The client connects through a load balancer on port 7432 The load balancer routes the connection to any available node That node executes the query and talks to other nodes if needed From the client’s point of view, this looks like a single server. Behind the scenes, however, the node can open additional internal connections to efficiently fetch distributed data. Query Types Determine Connection Use How many connections a query consumes depends on what kind of query you run. Single-Shard Queries Single-shard queries target data that lives on exactly one node. A typical example is a lookup by the distribution key. If you connect to the node that owns the data, the query runs locally If you connect to a different node, Citus uses an internal connection to fetch the data Multi-Shard (Fan-Out) Queries Multi-shard queries need data from multiple nodes. Aggregations and analytical queries often fall into this category. For these queries: The connected node has internal connections to many nodes Each node processes its local shards in parallel Results are sent back and combined As a result, one client connection can fan out into many internal connections. In both cases, when Citus needs to fetch data from another node, it first attempts to reuse a cached internal connection. If none is available, it creates a new connection and caches it. Important note: Explicit transactions change the behavior of single-shard queries. When a query runs inside a BEGIN … END block, Citus sends extra commands (BEGIN TRANSACTION + assign ID, COMMIT) to the worker node alongside the actual query. These extra network roundtrips add some coordination work compared to auto‑commit mode. Key Configuration Parameters Elastic Clusters expose configuration parameters that control how connections are created, cached, and limited. To understand connection scaling—or to tune it effectively—you need to know what these parameters do and when they matter. This section focuses on a subset; see the Citus blog posts and documentation for more details. Parameter Scope Description Typical use case Default max_connections Cluster wide Total connections allowed Set based on resources Varies by SKU citus.max_client_connections Per-node Connections allowed from clients Limit client load per node Varies by SKU citus.max_cached_conns_per_worker Per client connection Cached internal connections Control parallelism in fan-out queries 1 citus.max_adaptive_executor_pool_size Per client connection Max connections for parallel multi-shard execution Control parallelism in fan-out queries Varies by SKU, 1 or 16 In PostgreSQL, Memory Plays a Key Role in Connection Scaling: Idle connections typically use 2–5 MB Active connections often use 10–20 MB, depending on work_mem and query behavior Some Parameters Only Affect Certain Query Types: Parameters like citus.max_adaptive_executor_pool_size only affect multi-shard queries. Internal Connection Caching Is Critical: Internal connection caching has a large impact on performance. Setting citus.max_cached_conns_per_worker to 0 forces Citus to open new connections repeatedly This leads to additional connection setup work and reduces overall throughput Values greater than 1 only help multi-shard workloads by allowing more parallelism In practice, disabling internal caching is usually a bad idea. SKU Choice Changes Which Limits You Hit First Smaller nodes tend to reach memory thresholds sooner Larger nodes may hit connection limits before CPU or memory saturation How We Benchmarked To understand how connections scale in Elastic Clusters, we ran a set of controlled benchmarks using standard PostgreSQL tooling. Test Environment Tool: pgbench (PostgreSQL's standard benchmarking utility), running on a VM in the same region and zone as the cluster Dataset: Distributed pgbench_accounts table with a scale factor of 3000 (~38 GB) and distribution column aid (account ID). Setup commands: # Create table structure pgbench -i -I "dt" # Distribute the accounts table psql -c "SELECT create_distributed_table('pgbench_accounts', 'aid');" # Populate with data pgbench -i -I "gvp" -s 3000 # Add index for multi-shard queries CREATE INDEX index_bid ON pgbench_accounts(bid); Workloads Single-Shard Query: The single-shard workload targets one shard using the distribution key. \set aid random(1, 100000 * scale) BEGIN; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; END; Multi-Shard Query: The multi-shard workload aggregates data across multiple shards. \set bid random(1, scale) BEGIN; SELECT sum(abalance) FROM pgbench_accounts WHERE bid = :bid; END; How the Benchmarks Run Each benchmark: Ran for 600 seconds Used 16 pgbench worker threads Varied the number of client connections (-c) per run For each cluster configuration, we gradually increased the client count. We stopped when we observed connection or out-of-memory errors. Test Matrix We evaluated combinations of core counts, node counts, and query types: Workload Configuration tested Single-shard 2-core/4-node, 2-core/8-node, 4-core/4-node, 4-core/8-node Multi-shard 2-core/4-node, 2-core/8-node, 4-core/4-node, 4-core/8-node Single-shard with PgBouncer 2-core/4-node, 2-core/8-node Single-Shard Query Performance Single‑shard queries are a good model for OLTP workloads. They represent indexed lookups that target a single row (or a small set of rows). Massive Throughput Gains with Scaling Peak throughput climbed from ~11.4k TPS (2c4n) to ~48.3k TPS (4c8n), a 4.3× increase with quadruple cores and double the nodes. Higher Concurrency = Saturation Point Each configuration has an optimal operating point for throughput. Near-Linear Scaling Doubling cores roughly doubled throughput (≈2.4–2.5× gains), and doubling nodes added ~70–75% more TPS. Key Takeaways from Single-shard Experiments 1. Near-Linear Scaling with Resources Doubling CPU cores approximately doubled throughput, and doubling node count added 70-75% more throughput: Configuration Peak TPS Peak Clients CPU at Peak 2-core, 4-node 11,400 64 ~95% 2-core, 8-node 19,400 192 ~93% 4-core, 4-node 27,500 128 ~97% 4-core, 8-node 48,300 224 ~90% What this means: Adding compute resources effectively increases capacity. The slight sub-linearity when adding nodes (1.7× instead of 2×) is because the extra round-trips from explicit transactions only affect remote shard queries and the fraction for remote queries grows with the cluster size. In a 4-node cluster, 75% of queries hit a remote shard; in a 8-node cluster, 87.5% do. Since each remote query inside a BEGIN … END block pays the extra round-trip cost, the aggregate overhead increases as more nodes are added. 2.Saturation Points Vary by Configuration Each cluster configuration reaches peak throughput at a specific client count. Beyond this point, additional clients lead to: Throughput plateau Latency rises as the system operates near maximum capacity Increased contention and context switching Smaller configurations reach peak throughput sooner (64 clients for 2c4n), while larger ones handle several hundred concurrent clients before reaching peak. 3. Memory Constrains Maximum Connections, Not Node Count A key insight is that adding nodes does not necessarily increase total client capacity when the workload is constrained by memory limits. 2-core clusters: Maximum ~500 concurrent clients (both 4-node and 8-node) 4-core clusters: Maximum ~1000 concurrent clients (both 4-node and 8-node) Why? Each node maintains roughly client_count total connections: client_count / node_count external client connections (via load balancer) Remaining connections are cached internal Citus connections Once memory capacity is fully utilized, adding nodes alone does not always raise client capacity. External connections drop per node, but internal connections replace them, so each node still carries roughly the same load—and still can run out of memory. 4. CPU Utilization Reaches 90-97% at Peak All configurations fully utilized available CPU at peak throughput, confirming CPU as the primary throughput bottleneck (with memory limiting connection capacity separately). 5. Latency Characteristics Latency remains low (<5-10ms) until the system approaches saturation. Larger clusters maintain better latency under load: 4c8n cluster: Sub-5ms latency up to ~200 clients 2c4n cluster: Latency exceeds 10-20ms once saturated (~64 clients) Practical implication: Right-sizing your cluster provides headroom for traffic spikes without latency degradation. Single-Shard Query Performance with PgBouncer On Elastic Clusters, you can enable PgBouncer using server parameters. After enabling it, you connect to PgBouncer instances through the load balancer on port 8432. This connection pooling layer allows the cluster to handle far more concurrent client connections PgBouncer instances: One per node, behind load balancer Pool size: 50 connections per node (default) Pooling mode: Transaction pooling Eliminated Connection Limits Handled thousands of concurrent clients without out-of-memory errors Slightly Lower Peak TPS ~10% reduction in maximum throughput due to pooling overhead Linear Latency Growth Predictable queuing behavior once pool saturates Key Takeaways from Single-shard Experiments with PgBouncer 1. Graceful handling of high connection counts Without PgBouncer, 2‑core clusters reached memory capacity around 500 clients. With PgBouncer enabled, tests successfully ran with 1000+ concurrent clients. Throughput plateaued as the pool saturated, but the system remained stable. 2. Throughput-Latency Trade-Off Once the connection pool fills (~50 active connections per node), additional clients queue: Throughput stabilizes at the pool's processing capacity Latency increases with queue depth Predictable, graceful behavior under high load 3. When to use PgBouncer Recommended for: Applications with bursty connection patterns (many short-lived connections) High connection counts that exceed node memory capacity Workloads where occasional queuing latency is acceptable Not recommended for: Applications requiring maximum throughput from steady workloads Long-running transactions (incompatible with transaction pooling) Scenarios where every millisecond of latency matters Multi-Shard Query Performance Multi-shard (fan-out) queries aggregate or join data across multiple nodes, representing analytical or reporting workloads. Massive Throughput Gains with Scaling Scaling up and out dramatically improved fan-out query throughput – from ~52 TPS to ~1008 TPS on the largest (≈20× gain). Low Concurrency Saturation Multi-shard queries peaked at low client counts—just 8 clients for 2-core clusters and ~96 for 4-core, 8-node setups. Latency Improves with Scale Larger clusters maintained sub-100 ms latency under higher concurrency, while smaller ones degraded quickly. Memory & I/O Bottlenecks Sufficient memory is crucial for fan-out queries, as memory starvation causes throughput to plateau well before CPU is fully utilized. Key Takeaways from Multi-shard Experiments 1. Lower Absolute Throughput Compared to Single-shard Workloads Even the best-performing configuration (4c8n at ~1000 TPS) achieves ~2% of single-shard throughput (~48k TPS). This reflects the inherent complexity of the analytical fan-out queries: cross-node data aggregation, significantly large amount of data retrieval. 2. Scaling Provides Dramatic Gains While absolute TPS remains modest, the 20× improvement from smallest to largest demonstrates that multi-shard workloads benefit enormously from scaling. Configuration Peak TPS Peak Clients Latency at Peak 2-core, 4-node 52 8 ~150ms 2-core, 8-node 168 8 ~50ms 4-core, 4-node 489 32 ~65ms 4-core, 8-node 1,008 96 ~95ms 3. Saturates at Low Concurrency Multi-shard queries reach peak throughput at fewer concurrent clients than single-shard queries: 2-core clusters: Saturate at just 8 clients 4-core, 8-node: Saturates around 96 clients After these points, the system maintains stable TPS, with latency rising as load increases. 4. Memory and I/O Bottlenecks Dominate Small Configurations The 2-core configurations (8 GB RAM per node) showed clear resource pressure: Memory pressure: Working sets exceeded available RAM, causing paging High IOPS: Thousands of disk operations per second indicated swapping to disk Throughput ceiling: Memory availability capped TPS before CPU was fully utilized In contrast, 4-core configurations (32 GB RAM per node) kept working sets in memory, achieving much higher throughput with minimal I/O. Key insight: For multi-shard workloads, sufficient memory is more important than CPU cores. Adequate memory provisioning is essential to unlock full performance. 5. Latency Escalates Rapidly Under Overload All configurations delivered fast response times (<50ms) at low loads. Once saturated: 2c4n cluster: Latency increased noticeably under sustained overload 4c8n cluster: Remained under 100ms until approaching the 96-client saturation point Practical implication: For multi-shard query workloads, over-provision resources to maintain consistent and predictable latency. Conclusion Connection scaling in Azure Database for PostgreSQL Elastic Clusters is a multifaceted challenge that depends on workload characteristics, cluster configuration, and resource constraints. Key takeaways from our benchmarking on read workloads: For Single-Shard OLTP-like Workloads: Scaling provides near-linear throughput gains (4.3× from 2c4n to 4c8n) Memory, not node count, determines maximum concurrent connections CPU becomes the throughput bottleneck at peak load PgBouncer trades ~10% throughput for almost unlimited connection scalability For Multi-Shard OLAP-like Workloads: Throughput operates at a different scale than single‑shard workloads Relative gains from scaling are massive (20× improvement observed) Memory sufficiency is critical—adequate RAM is essential to maintain strong performance Saturation occurs at low concurrency; keep client counts conservative General Principles: Scale up (higher SKU) to support more connections and memory-intensive queries Scale out (more nodes) to increase aggregate throughput and data capacity Use PgBouncer to manage connection bursts and exceed node memory limits Monitor continuously and adjust based on actual workload patterns By understanding these dynamics and applying the decision frameworks provided, you can architect Elastic Clusters that deliver optimal performance, reliability, and cost-efficiency for your specific application requirements. References and Resources https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-elastic-clusters https://learn.microsoft.com/en-us/postgresql/citus https://www.postgresql.org/docs/current/runtime-config-connection.html https://www.postgresql.org/docs/current/pgbench.html Analyzing the Limits of Connection Scalability in Postgres | Microsoft Community Hub203Views3likes0Comments