azure database for postgresql flexible server
119 TopicsSELECT * FROM build2026_sessions WHERE postgres = true;
Microsoft Build 2026 is around the corner, and this year it’s shaping up to be a big one for PostgreSQL experts and enthusiasts. If you’re a developer working with Postgres, or just love exploring new database technology, there's plenty to get excited about. Microsoft’s new cloud-first evolution of PostgreSQL, Azure HorizonDB, alongside sessions featuring Azure Database for PostgreSQL, will highlight how Postgres is powering the next wave of AI-driven applications. A new horizon in Postgres Build 2026 arrives at a time when the role of databases in modern apps is evolving rapidly. From enabling AI model integration to scaling seamlessly across the cloud, PostgreSQL developers today are dealing with more complex demands than ever. That’s why Azure HorizonDB – Microsoft’s new cloud-native PostgreSQL service – is generating so much buzz ahead of Build. What is Azure HorizonDB? In short, it’s a reimagined version of PostgreSQL designed for cloud-scale performance and AI-era workloads. Azure HorizonDB, introduces a distributed architecture that decouples compute and storage, delivering sub-millisecond latencies and three times the throughput of self-managed Postgres at massive scale. It aims to preserve Postgres’s beloved features and SQL ecosystem while adding next-generation capabilities: built-in vector indexing for high-speed AI/ML retrieval, the ability to run AI models and vector operations directly in the database, and multi-zone replication for resilience. For Postgres developers, this means less time stitching together external data stores or machine learning services – and more time building powerful apps on a unified platform that’s both familiar and built for the future. The bottom line: Microsoft Build 2026 is an ideal opportunity for developers to see Azure HorizonDB in action, learn best practices for modern PostgreSQL architectures, and understand how to leverage Postgres in new scenarios like generative AI and multi-agent applications. Read on for a rundown of sessions covering these topics, complete with what you’ll learn from each one. Top sessions for PostgreSQL databases on Azure Below are key sessions tailored for PostgreSQL users and those interested in Azure HorizonDB, with session types and highlights of what you’ll gain by attending. 🎤 Breakout: From Rows to Reasoning: Designing Databases for AI Apps and Agents (BRK223, 45 min, in-person and digital options) Discover how to architect databases that can power tomorrow’s intelligent applications. This technical breakout will show how AI-ready databases can move beyond plain transactions. You’ll see live demos of integrating transactional, analytical, and vector data in one unified platform, with Azure’s new database capabilities, including Azure HorizonDB. Learn how to simplify your stack by eliminating separate analytics engines or vector stores. The session will highlight patterns that reduce data movement and latency so your apps can efficiently reason over live data with minimal complexity. 🧪 Hands-on lab: Create Advanced Postgres-Powered Agentic Apps with Azure HorizonDB (LAB511, 75 min, in person and digital options) Roll up your sleeves and get hands-on building a real multi-agent AI application with Postgres. In this advanced lab, you’ll create a production-ready AI agent powered by Azure HorizonDB as an all-in-one data, search, and intelligence layer. Experiment with retrieval-augmented generation (RAG) by combining semantic vector search (DiskANN) with traditional SQL queries right inside the database. Implement hybrid search and agent workflows without resorting to external vector databases or glue code – thanks to HorizonDB’s built-in vector indexing and in-database AI model capabilities. This lab is perfect for developers who want to experience how HorizonDB can simplify your stack and boost performance for AI-driven apps. Multiple hands-on labs are offered to suite your schedule. 💻 Demo: Simplify App Dev with Cloud-Native PostgreSQL in Azure HorizonDB (DEM364, 25 min, in-person and digital options) See how to cut your development time and complexity with built-in AI and search features in Postgres. This fast-paced demo shows how Azure HorizonDB helps eliminate the need for separate search engines and AI services by pulling those capabilities straight into PostgreSQL. Expect to learn how you can run hybrid vector + keyword queries using SQL, integrate AI models directly from within the database, and apply full-text search (BM25) and semantic ranking to get smarter results. If you’re eager to deliver intelligent apps faster, with fewer moving parts, this session will show how HorizonDB simplifies your architecture without sacrificing performance. ⚡Lightning Talk: Cloud-Native PostgreSQL, Rebuilt for Scale: Azure HorizonDB (LTG413, 15 min, in-person only) Get a rapid-fire introduction to the architecture behind HorizonDB’s eye-popping performance. This short talk dives into how HorizonDB re-architects core PostgreSQL to deliver effortless scale out and blazing speed. Learn how decoupled compute and storage, predictive caching, and multi-region replication combine to achieve sub-millisecond query latencies and 3× higher throughput than standard Postgres. If you care about performance tuning and high-scale database design, don’t miss this quick primer on the tech under HorizonDB’s hood. 👥 Interactive Table Talk: Scaling PostgreSQL for AI Apps: Patterns and Tradeoffs (TT622, 45 min, in-person only) Bring your questions and ideas to this collaborative discussion. In this open round-table session with community and Microsoft experts, you’ll explore architecture patterns for scaling PostgreSQL to meet the demands of agent-based and AI-driven applications. Discuss real-world strategies for handling vector embeddings in Postgres, unifying relational and document data, integrating with AI models, and more. Compare the trade-offs between different scaling approaches – from monolithic to microservices, sharding strategies, and new technologies like HorizonDB – and learn where each design shines or struggles in production. Come ready to share your experiences and learn from others in the room. ▶️ On-Demand: Smarter PostgreSQL Migrations to Power Modern, Intelligent Apps (OD822, 30 min, digital only) Planning to migrate to Postgres or move your databases to Azure? Start here. This on-demand session focuses on new tools and proven strategies to migrate large-scale databases to Azure Database for PostgreSQL quickly and safely. You’ll see AI-assisted migration tools in action that minimize downtime and risk when moving terabytes of data. Just as importantly, you’ll learn how migrating to Azure unlocks advanced capabilities – from boosted performance and enhanced security to AI-ready features – helping you turn your newly migrated data into intelligent apps and services. On-demand session will be available to stream on the first day of Build. Meet the team: PostgreSQL expert meetups If you’re attending Build in person, stop by the Expert Meetup (EMU) area and head to the relational cloud databases booth. This is your chance to talk directly with the engineers and product teams behind PostgreSQL on Azure. Bring your questions about architecture decisions, scaling patterns, migrations, AI workloads, or anything else on your mind. Whether you want to sanity-check a design, dig deeper into something you saw in a session, or give direct feedback, the EMU space is designed for exactly that convo. How to get the most out of Build (and what to do next) With so much great content lined up, how do you decide where to start? It really depends on what you’re most excited about: Curious about AI and agentic apps: Start with From Rows to Reasoning, then go deeper with the Simplify App Dev with HorizonDB demo or get hands-on at the Azure HorizonDB labs to see how these ideas work in practice. Performance and scale are your focus: The short Lightning Talk on HorizonDB’s cloud-native architecture and the Table Talk on scaling Postgres will both provide unique insights and pro tips for running Postgres at massive scale. Planning a migration to PostgreSQL on Azure: Watch the Smarter PostgreSQL Migrations on-demand session to learn how to migrate large workloads with minimal downtime, and the benefits you can unlock after moving to Azure. Looking for real answers to your specific questions: Make time for the PostgreSQL Expert Meetup area to connect directly with the team. No matter which sessions you choose, Build 2026 promises to be an exciting event for the PostgreSQL developer community. Browse the session catalog, save the sessions that match your interests, and we’ll see you at Build.42Views1like0CommentsReal-World Success Stories with PostgreSQL on Azure
Organizations rarely leap into cloud migrations or AI-powered systems overnight. They progress in deliberate stages, establishing a reliable data foundation, optimizing for performance, and then accelerating innovation. Across healthcare, financial services, and AI startups, companies are navigating this journey on Azure Database for PostgreSQL: a fully managed, enterprise-ready PostgreSQL environment with 58% lower total cost of ownership (TCO) compared to on-premises deployments. This post walks through real customer stories that span the full arc, from lift-and-shift migration to production-grade AI agent development, illustrating how Azure Database for PostgreSQL supports scalability, performance, security, and AI-readiness at every stage. Migrating with Confidence: Apollo Hospitals & August AI Apollo Hospitals operates a network of more than 74 hospitals and needed to move beyond a legacy on-premises Oracle system that had become difficult to manage and couldn't keep pace with growing data volumes. IT teams were spending their time on maintenance rather than innovation. Apollo migrated its core hospital information system backend to Azure Database for PostgreSQL. Working with partner Quadrant Technologies, the team lifted and shifted critical applications while using Azure DevOps to orchestrate CI/CD pipelines and Azure Application Insights for telemetry and observability. The results: 99.95% availability across hospital systems Database transactions executing within 5 seconds 40% reduction in deployment times via modern CI/CD pipelines Decreased operational overhead, freeing IT staff for higher-value work With a stable, scalable PostgreSQL backend in place, Apollo is now exploring real-time analytics and AI-enabled tools like Microsoft 365 Copilot to advance patient care. "We saw Azure Database for PostgreSQL as the right foundation for the future. It's open, cost-effective, and capable of supporting the hospital information system we built in-house." — Shankar Krishna A., General Manager of IT, Apollo Hospitals Apollo's experience is not unique. August AI, a healthcare-tech startup offering an AI-driven medical companion, migrated its entire stack to Azure—with Azure Database for PostgreSQL storing mission-critical patient data while meeting strict compliance requirements such as HIPAA. The result: scaling from roughly 500,000 users to 3.5 million+ users worldwide, with zero downtime during the cutover, completed in just three months. As Founder and CEO Anuruddh Mishra noted: "We receive a log of queries that are not performing optimally, and within a couple of minutes we can optimize that query with PostgreSQL on Azure and move on". Modernizing at Scale: Nasdaq Migration is often the first step. Nasdaq demonstrates what becomes possible when organizations modernize their architecture on a scalable data foundation. To improve its Nasdaq Boardvantage platform—used by corporate boards to collaborate on governance documents—Nasdaq re-architected on Azure. The team containerized services with Azure Kubernetes Service (AKS) and adopted Azure Database for PostgreSQL alongside Azure Database for MySQL as persistent data stores for governance workloads. This architecture provided the flexibility, performance, and security required for a multitenant platform handling sensitive board materials. With the data layer in place, Nasdaq integrated Microsoft Foundry and Azure OpenAI to deliver AI-powered summarization and workflow automation. The measurable outcomes: 60% reduction in reading time through AI-powered document summarization 25% decrease in administrative preparation time across board workflows Up to 97% accuracy in AI-generated summaries and meeting minutes A reusable AI framework established for future extensibility "Both Azure Database for PostgreSQL and Azure Database for MySQL gave us the right balance of performance, security, and control. The governance workloads we handle are unique, so we needed something that could meet those isolation and encryption requirements." — Scott Ellison, Vice President of Technology, Nasdaq Building Intelligent Applications: SubgenAI and OpenAI Azure Database for PostgreSQL now supports native vector search via pgvector, high-performance DiskANN indexing, semantic operators and AI model management, and integrated graph capabilities for relationship reasoning—making it a production-ready foundation for intelligent applications. SubgenAI, a European generative AI company, built its flagship platform Serenity Star on Azure Database for PostgreSQL and Microsoft Foundry to transform AI agent development from a code-heavy, fragmented process into a streamlined, no-code experience. A core technical requirement: the platform's retrieval-augmented generation (RAG) system needs efficient vector search against embedded content while maintaining enterprise-grade reliability. After evaluating several database options, SubgenAI chose Azure Database for PostgreSQL with pgvector for its accurate and scalable vector similarity search. Serenity Star customers can now: Launch AI agents in as little as 15 minutes Cut coding and development time by 50% Resolve most AI agent queries in under 60 seconds [ "With Microsoft and Azure Database for PostgreSQL we have total control and an environment that is truly dynamic and can adapt to the evolution we're looking for." — Julia Schröder Langhaeuser, VP of Product Serenity Star, SubgenAI At the extreme end of scale, OpenAI runs PostgreSQL on Azure to support production systems behind ChatGPT. As write scalability limits emerged on an initially unsharded single primary instance, OpenAI offloaded write-heavy operations to other systems and optimized read workloads using PgBouncer for connection pooling. The Azure Database for PostgreSQL team responded by developing the elastic clusters feature, enabling horizontal scaling through row-based and schema-based sharding. The team reduced connection latency from approximately 50 ms to under 5 ms, scaled reads horizontally with multiple replicas, and improved reliability by prioritizing critical requests—all achieved by a small team making systematic optimizations on open-source PostgreSQL. "After all the optimization we did, we are super happy with Postgres right now for our read-heavy workloads. It's really scalable and reliable." — Bohan Zhang, Member of the Technical Staff, OpenAI Meeting You Where You Are Beyond these stories, organizations like BMW Group (cloud-native applications at global scale), Ahold Delhaize (highly available retail applications), Mott MacDonald (an AI agent accelerating onboarding and spreading best practices across 220,000 employees), and Multitude (scaling responsibly in regulated environments) all run on Azure Database for PostgreSQL. The service offers 99.99% availability with automatic failover and SLA, independent compute and storage scaling, and intelligent performance recommendations, available across 60+ Azure regions. Developer tooling including the PostgreSQL extension for Visual Studio Code with GitHub Copilot further accelerates productivity. Whether you are planning your first migration or building production AI agents, these stories share a clear signal: Azure Database for PostgreSQL delivers a scalable, secure, AI-ready data foundation at every stage of growth. Explore full customer stories in depth in the eBook: Customer Success Stories with Azure Database for PostgreSQL.102Views1like0CommentsNative Database/Query Monitoring with Azure Database for PostgreSQL
Monitoring PostgreSQL Database Engine The PostgreSQL database engine provides multiple administrative views, which can be leveraged to capture metrics of activities inside of the database engine including, but not limited to, database activities, query activities, execution time, database resource locks etc for initial and deep analysis on database & query performance. Database Administrators should use these views and their corresponding metrics to perform initial level analysis. Critical insights can be gathered around database activity, performance and system activity for reactive and proactive analysis. In this blog, I am going to show the important administrative views and their corresponding SQL statements for capturing database and query details. The following section provides details about the statistics views and their usage. It contains the most important and mostly used statistics views, there are other views available that can be used to get into further details. NOTE: The queries and flows provided below are a few of the ways for each scenario. Eventually, the final approach will depend on and be dictated by the issue at hand or its symptoms. Pre-requisites The following parameters should be enabled in Azure Database for PostgreSQL “Server parameters” before using the statistics views: track_activities track_counts track_io_timing pg_stat_statements.track = ALL The following extensions should be installed before using the stats: create extension pg_stat_statements; create extension pgstattuple; NOTE: All SQL queries mentioned below are as per PostgreSQL version 18. Schema & Data for Analysis All the "Example Scenario" mentioned in the analysis queries are based on the following schema and corresponding volumetrics: Table Name No. of rows customer 993,814 invoice 20,229,119 orders 4,497,292 product 2,818,000 Statistics Views and SQL Queries for Analysis The following SQL statements can be used to perform 1 st level monitoring on the database, tables and queries. Note that the output counters and metrics from the SQL statements represent cumulative data since the last server restart. The statistics should be reset to zero before performing troubleshooting or performance tuning exercises. Use the below mentioned SQL statements to reset statistics at database, table and IO level. Database level stats reset SELECT pg_stat_reset(); IO level stats reset. SELECT pg_stat_reset_shared('io'); Reset pg_stat_statements metrics per user, query and database id or for the entire database - pg_stat_statements contain query level metrics for a database, user or a single query. We can reset the statistics at these individual levels. If any of the parameters are not specified, the default value 0 is used for each of them and the statistics that match with other parameters will be reset. "userid", "dbid" (Database ID), "queryid" can be fetched from pg_roles (column - oid), pg_stat_database (column - datid) and pg_stat_activity (column - query_id) respectively. SELECT pg_stat_statements_reset(userid, dbid, queryid); SELECT pg_stat_statements_reset(); SQL Activity Monitoring: This query uses pg_stat_activity statistics view, which contains details of current activities in the database server. It can be used to track queries in transactions, transaction execution time, wait event details if any etc. The WHERE clause can be used to filter activities based on client IP address (client_addr), application name (application_name), user (usename) who has submitted the query and text of the SQL statement. SELECT datname AS dbname, pid, usename as username, application_name, client_addr as client_ip_addr, xact_start AS transaction_start_time, query_start AS query_start_time, wait_event_type AS db_waitign_on, wait_event AS waiting_activity, CAST(query AS varchar(100)) AS sql_query, backend_type AS db_operation FROM pg_stat_activity WHERE <conditions>; Interpreting output: This output should be used to find overall activity in your database. username, application_name, client_ip_addr - The user, application and client IP address from where the query is being executed. transaction_start_time - When transaction started (using BEGIN keyword or corresponding API). query_start_time - Start time of currently active query in the transaction. wait_event and wait_event_type columns output the events on which the query is waiting, if applicable. Refer to the official links for details on wait events and wait event types. db_operation - Type of database operation that is being performed by the application. E.g. autovacuum, parallel worker, client backend, etc. Example Scenario: Application team is complaining about higher load in "retail_db" database and it is causing multiple SQL statements to run slow, team wants to know whether ad-hoc SQL statements are being run using "psql" client by application developers. As a DBA, you want to look at the entire database activity - no. of SQL statements, who are running to SQLs and which client are they coming from. Above mentioned SQL can be used with filters on "application_name" (value = "psql") and "dbname" (value = "retail_db"). It generates the following output: Analysis: From the above output, it can be seen that "psql" client is indeed getting used from a few IP addresses. Next Step: pg_terminate_backend(<pid>) can be used to kill the applications with pid 22210, 22209 and 21993. Similarly, the same SQL statement can be used to look at overall SQL activities in individual database. SQL Statement Level monitoring This is one of the most important SQL statements that describes operations at query level. It provides execution time detail of queries, read time, write time, disk read and disk write. Output of this query should serve as one of the starting point of query troubleshooting and query optimization. SELECT queryid , CAST(query AS varchar(100)) AS sql_stmt, calls AS no_of_executions, total_exec_time, min_exec_time, max_exec_time, mean_exec_time AS avg_exec_time, rows AS rows_fetched, shared_blks_hit AS blocks_read_from_buffer, shared_blks_read AS blocks_read_from_disk, shared_blks_written AS blks_written_to_disk, shared_blk_read_time, shared_blk_write_time FROM pg_stat_statements WHERE query LIKE '%<sql-stmt>%'; Interpreting output: Output of this query can be used to dive deeper into SQL performance issues. sql_stmt - Provides the SQL statement that is being run. This is the query that is being analyzed for tuning. no_of_executions - How many times this SQL statement has been executed so far (since the reset of pg_stat_statements). total_exec_time - This is the execution time for the query combining all runs. e.g. if the query was run 10 times and each time took 1.5s, then total_exec_time would be 15s. min_exec_time - Min. time the query took to execute. This can be considered as the benchmark execution time for this query. max_exec_time - Max. time the query took to execute. avg_exec_time - Avg. execution time of the query. This time should fall within the query SLA. rows_fetched - No. of rows needed by application (SQL query) blocks_read_from_buffer & blocks_read_from_disk - Data blocks (8KB size) read from PostgreSQL shared memory and from disk respectively. shared_blk_read_time & shared_blk_write_time - Total time spent in reading from and writing to disk. This time is included in total_exec_time. If ratio of shared_blk_read_time or shared_blk_write_time or both with total_exec_time is higher, then IO is causing high execution time in this query. If the ratio is less then other complex SQL operations like join, order by, group by, functions etc. are taking longer to execute. Example Scenario: We will be using the same SQL statement that was used in the example scenario of "3. Application-wise IO Activities". But in this case, using "SQL Statement Level monitoring" we will look at the execution details of the query rather than IO usage. Following is the output of first execution (after server restart - shared memory was empty) of the SQL statement Analysis: Here is the observation, execution time of the query was 154.81 seconds. Total rows needed by application was 4,497,292 out of which 341 8KB blocks were read from shared memory (PostgreSQL bufferpool) and 58504 8KB blocks were read from disk. Following is the output after 2nd execution (after resetting pg_stat_statements): Analysis: The query was executed 4 times. Execution time got reduced to 95 seconds, but it is still high. Even though all rows (4497292 x 4) were read from shared memory (buffer pool) as blocks_read_from_disk was zero, execution time is very high. Next Step: Even if the execution time is found to be within SLA, the query needs to be analyzed for its execution plan. Use EXPLAIN ANALYZE command to analyze the access plan or execution path of the query. Query optimization must be done for this SQL statement. Table-level Bloat Statistics PostgreSQL uses MVCC (Multi Version Concurrency Control) to avoid read and write conflicts, which means one application can read rows while they are being modified by another application. This is done by storing multiple versions of the same row. But over time older versions, which are not needed by applications, become overhead as they occupy space. This is known as "Bloating". It can cause both storage and performance issue. The query mentioned below provides bloating percentage on a table, which should be monitored on a regular basis and DBAs should decide on the autovaccuum settings. SELECT (table_len/1024/1024) AS table_size_mb, tuple_count AS no_of_live_rows, (tuple_len/1024/1024) AS live_row_size_mb, dead_tuple_count AS no_of_dead_rows, (dead_tuple_len/1024/1024) AS dead_row_size_mb, tuple_percent AS live_rows_percent, dead_tuple_percent AS dead_rows_percent, free_percent AS free_space_percent, (((table_len – tuple_len)/table_len)*100) AS bloat_percent FROM pgstattuple('<table-name>'); Interpreting output: table_size_mb - This is the total table size, including bloating. no_of_live_rows - No. of actual usable rows in the table. live_row_size_mb - This is the actual size of the table. This is the amount of data in this table that can be used by applications. no_of_dead_rows - No. of irrelevant rows in the table. dead_row_size_mb - Size of irrelevant rows in the table. This is the amount of data that is not needed by any application, but is still there in the table and occupying storage. This is the data that has to be removed from the table. bloat_percent - Percentage of dead rows occupying the table. If it goes beyond 25-30%, manual VACUUM should be used to remove bloating. Example Scenario: invoice_history table contains more than 20 M rows and has a size of 1630 MB. Deleting records from a table does not reduce the size in PostgreSQL because of MVCC support to keep multiple versions of the same row. The deleted rows still occupy storage space, it is known as bloating as shown below. "Table level Bloat Statistics" can be used to find the bloat percentage of a table. Next Step: Aggressive AUTOVACUUM has to be configured to remove bloating on time. As AUTOVACUUM does not free up storage space, manual FULL VACUUM has to be executed to reclaim storage space. Note that full vacuum takes exclusive lock on tables. Bloat percentage changed after executing full vacuum. Table Bloat using Stat table Table bloating information using statistics table. SELECT schemaname, relname AS tblname, n_live_tup AS live_rows, n_dead_tup AS dead_rows, ((n_dead_tup - n_live_tup)/100) AS tbl_bloat_percent, last_vacuum AS last_manual_vacuum, last_autovacuum FROM pg_stat_all_tables WHERE relname = '<tblname>'; Interpreting output: Same as "Table-level Bloat Statistics". Database Activity Monitoring This SQL statement provides row level counters of DML statements for a database. SELECT datname AS dbname, blks_read AS total_no_of_read_ops, blks_hit AS no_of_blks_read_from_bufferpool, tup_returned AS no_of_rows_scanned, tup_fetched AS no_of_rows_fetched, tup_inserted AS no_of_rows_inserted, tup_updated AS no_of_rows_updated, tup_deleted AS no_of_rows_deteled, deadlocks AS no_of_deadlocks, (blk_read_time/1000) AS blk_read_time_s, (blk_write_time/1000) AS blk_write_time_s FROM pg_stat_database; Interpreting output: Output shows the impact of DML operations in a database. total_no_of_read_ops - This is the total block read request executed by the database engine. no_of_blks_read_from_bufferpool - Out of total read requests, how many blocks were already there in memory or shared_mem. no_of_rows_scanned & no_of_rows_fetched - Total no. of rows read from the database and the total no. of rows actually needed by application. Higher difference in the two metrics should call for query tuning using better indexes, join strategies etc. blk_read_time_s & blk_read_write_s - Time taken in seconds, to perform read and write operations. no_of_deadlocks - Higher no. of deadlocks should be investigated. Execution of application SQL queries might have to be re-looked if higher deadlocks create performance or application issues. Index-level Bloat Statistics – Live rows, dead rows, bloat This is similar to query no. 8, but it provides bloating information at index level. SELECT (index_size/1024/1024) AS idx_size_mb, (index_size/8192) AS index_pages, empty_pages, deleted_pages, ((index_pages - deleted_pages)/index_pages) * 100 AS idx_bloat_percent FROM pgstatindex('<idx-name>'); Interpreting output: The output is similar to table-level bloat statistics, but it is for indexes. Monitor IO activities Output of the following SQL statement shows the overall IO (Input/Output) activities for a database. It can be used to infer whether database workload is read heavy or write heavy. For read workloads, database level bufferpool_hit_ratio metrics can be used to increase or decrease shared memory configuration parameter. SELECT backend_type , object, context AS reason_for_io, reads AS no_of_reads, (read_bytes/1024) AS read_data_kb, read_time AS read_time_ms, writes AS no_of_writes, (write_bytes/1024) AS write_data_kb, write_time AS write_time_ms, writebacks AS no_of_blocks_to_storage, writeback_time AS block_to_storge_write_time_ms , hits AS no_of_reads_from_bufferpool, 100 * (hits / (hits + reads)) AS bufferpool_hit_ratio FROM pg_stat_io; Interpreting output: The output showcases database level IO metrics. backend_type - Type of database operation that is being performed by the application. E.g. autovacuum, parallel worker, client backend, etc. reason_for_io - corresponding IO operation. Following are the possible output values to look out for. normal - Read from or write to shared buffers. vacuum - IO used by VACUUM operation. bulkread & bulkwrite - Disk read and write operations. no_of_reads - Total read operations. read_data_kb - Amount of data read (in KB) across all operations. no_of_writes - Total write operations. write_data_kb - Amount of data written (in KB) across all operations. read_time_ms & write_time_ms - Duration of all read & write operations. Example Scenario: Extending the previous example, application team is complaining about overall performance of the database. As a DBA you want to find out the overall read and write operations to confirm whether it is causing slowness in the database. It will give an idea about how busy the database has been. "Monitor IO Activities" SQL can be used to find out the trigger of higher IO activities in the database. Following is the output: Analysis: Output above shows that most of the read and write operations are emanating from applications ("backend_type" = "client backend"). This gives an idea of database usage by the applications. Next Step: If resource utilization of this database is high, we can create read replica to offload the reads and increase server memory and CPU to handle the load. Or else, if possible, application throttling can be infused by reducing the no. of max connections or using application level throttling. Application-wise IO Activities Query no. 2 above provides database level IO, the following SQL statement can be used to track application level IO summary. As there is to one-to-one mapping between application id in pg_stat_activity and pg_stat_io, backend_type is the only way to relate these two admin views. SELECT psa.datname AS dbname, pid, psa.usename AS username, psa.application_name, psa.client_addr as client_ip_addr, psa.xact_start AS transaction_start_time, psa.query_start AS query_start_time, psa.wait_event_type AS db_waitign_on, psa.wait_event AS waiting_activity, CAST(psa.query AS varchar(100)) AS sql_query, state as current_state, psa.backend_type AS db_operation, psi.context AS reason_for_io, psi.reads AS no_of_reads, psi.writes AS no_of_writes FROM pg_stat_activity psa JOIN pg_stat_io psi ON psa.backend_type = psi.backend_type WHERE psa.query LIKE '%<sql-stmt>%'; Interpreting output: username, application_name and client_ip_addr column output can be used to track user who submitted the query, application from where the query was submitted and IP address of the application server from where query was executed. transaction_start_time & query_start_time show the execution start time of transaction and currently active query inside of the transaction. These metrics should be used to find the execution time of transaction and queries inside of the transaction. wait_event and wait_event_type columns output the events on which the query is waiting, if applicable. Refer to the official links for details on wait events and wait event types. reason_for_io column outputs corresponding IO operation. Following are the possible output values to look out for. normal - Read from or write to shared buffers. vacuum - IO used by VACUUM operation. bulkread & bulkwrite - Disk read and write operations. no_of_reads & no_of_writes - Total no. of read and write operations by the query. Example Scenario: So far we have been looking at database-level activities. In most of the cases, we need to track application level activities like read, write operations, application elapsed time (how long the query is getting executed for). These are some of the critical information that DBA should use while troubleshooting query performance or application slowness issue. Whereas pg_stat_statements should be the go to place to look at SQL or application-level statistics, above SQL statement can be used to monitor application level IO operations. Executed SQL statement : SELECT c.cust_id, c.cust_email, c.cust_type, o.order_no, o.order_date FROM customer c, orders o WHERE c.cust_id = o.cust_id ORDER BY o.order_date DESC; Output of the query can be filtered using partial SQL statement in the WHERE clause or using PID of the SQL statement as shown below. Following is the output of the query above (showing only relevant columns): Analysis: Highlighted output above shows 3 types of IO operations - normal, bulkwrite and bulkread. "normal" means that client is reading from buffer, but "bulkread" and "bulwrite" indicates read from disk and write to disk (temporary file write for sorting!). Next Step: SQL statement should to be analysed further for index usage, high disk read and write IO. EXPLAIN (with ANALYZE) statement can be used to understand query behaviour. Other monitoring SQL statements should be used to dig deeper into the execution details. Table & Index Operations The following SQL statement can be used to capture table-level DML operations and its corresponding index read operations. Output of this query can be used to create index if sequential scans on a table is high, also, how an application is impacting the table rows using INSERT, UPDATE, DELETE statements. SELECT t.schemaname AS tabschema, t.relname AS tabname, t.seq_scan AS no_of_seq_scan, t.seq_tup_read AS seq_scan_rows_read, t.n_tup_ins AS rows_inserted, t.n_tup_upd AS rows_updated, t.n_tup_del AS rows_deleted, t.n_live_tup AS total_active_rows, t.n_dead_tup AS total_dead_rows, i.schemaname AS idxschema, i.indexrelname AS idxname, i.idx_tup_read AS idx_rows_read, i.idx_tup_fetch AS idx_rows_fetched FROM pg_stat_all_tables t JOIN pg_stat_all_indexes i ON t.relid = i.relid AND t.relname = i.relname ; Interpreting output: It is almost similar to database activity monitoring except for the following columns. total_dead_rows - Total no. of rows that are not relevant for any transaction in PostgreSQL. total_active_rows - Total no. of actual rows in the table. If the ratio of dead rows vs active rows is higher, it shows VACUUM is not working properly. It can be fixed by running manual vacuum or changing the configuration of AUTOVACUUM. Table level Read IO This SQL statement provides critical metrics on read operations for all or individual tables. heap_blks_read and heap_blks_hit should be used to find out the disk read vs memory read in tables. Higher disk read indicates memory crunch and can be a factor in deciding whether to increase shared memory size. SELECT schemaname AS tabschema, relname AS tabname, heap_blks_read AS no_of_tab_blks_read, heap_blks_hit AS no_of_buffer_blks_read, idx_blks_read AS no_of_idx_blks_read, idx_blks_hit AS no_of_buffer_idx_blks_read FROM pg_statio_all_tables WHERE relname = '<table-name>'; Lock Information Data modification by applications or system commands locks rows and tables. Other application trying to modify the same row will get into waiting state thereby increasing the response time. Following SQL statement can be used to find out what type of locks is being held by applications on all or specific tables. SELECT db.datname AS dbname, tbl.relname AS tblname, lck.pid, lck.locktype , mode AS lock_mode, CASE lck.granted WHEN True THEN 'lock_held' WHEN False THEN 'lock_waiting' ELSE 'not_known' END AS lock_status, lck.waitstart AS lock_wait_start_time FROM pg_database db JOIN pg_locks lck ON lck.database = db.oid JOIN pg_class tbl ON lck.relation = tbl.oid WHERE tbl.relname NOT LIKE '%pg_%' AND tbl.relname = '<tablename>'; Interpreting output: Output rows contain database name, table name, process id of the query and the following fields. locktype - Object of the lock. E.g. table, page, row (tuple) etc. lock_status - Whether the lock has been granted to this application or application is waiting for lock. lock_wait_start_time - If lock waiting state, then when the wait was started. Example Scenario: Following SQL statement takes a lock on "orders" table as it is getting executed inside of a transaction which has not been either committed or rolled back. "Lock Information" SQL can be used to find out who is holding lock on a table as shown below: Next Step: If lock needs to be released then the application with pid 15505 must be terminated using pg_terminate_backend(15505). Who is Holding and who is Waiting for Locks If locks are being held long enough to block a critical application, it becomes imperative to terminate the application holding the lock. Following SQL statement provides information about applications waiting for locks and applications holding those locks so that necessary actions can be taken by DBA. WITH lock_holder AS (SELECT db.datname AS dbname, tbl.relname AS tblname, pid, locktype , mode AS lock_mode FROM pg_database db JOIN pg_locks lck ON lck.database = db.oid JOIN pg_class tbl ON lck.relation = tbl.oid WHERE granted = true), lock_waiter AS (SELECT db.datname AS dbname, tbl.relname AS tblname, pid, locktype , mode AS lock_mode FROM pg_database db JOIN pg_locks lck ON lck.database = db.oid JOIN pg_class tbl ON lck.relation = tbl.oid) SELECT lh.pid AS pid_holding, lh.locktype AS holding_lock_type, lh.lock_mode AS holding_lock_mode, lh.tblname AS holding_lock_table, lw.pid AS pid_waiting, lw.locktype AS waiting_lock_type, lw.lock_mode AS waiting_lock_mode, lw.tblname AS waiting_lock_table FROM lock_holder lh JOIN lock_waiter lw ON lh.tblname = lw.tblname WHERE lw.tblname='<tbl-name>'; Interpreting output: This is an extension of query 11 above. Following are the output columns: pid_holding - Application that is holding lock. holding_lock_type - Object (table, page, row) that has been locked. holding_lock_mode - Lock mode that is being held. holding_lock_table - Table that is being locked. pid_waiting - Application that is waiting for lock. waiting_lock_type - Object (table, page, row) that is waiting for lock. waiting_lock_mode - Lock mode that is needed. waiting_lock_table - Lock on table that is waiting. Example Scenario: There can be scenarios where multiple applications can take write lock on the same rows at the same time. Depending on lock_timeout, one of the applications has to wait for that duration. Criticality of an application might dictate the terms of waiting on locks and DBAs have to terminate application holding the lock. Following is an application that is updating rows in a transaction, but has not released the lock. And following application has timed out because of lock wait. Next Step: Using above query we can find out which application is waiting and who is holding lock on a table as shown below. Depending on priority one of the applications can be terminated by DBA. Diagnostic Flow: The following chart can be used to decide which admin view to use and when. It also shows how Azure Monitor and PostgreSQL admin views work in tandem at any analysis scenarios. Summary: PostgreSQL provides helpful administrative views that can be used by DBAs (Database Administrators) to perform initial level of analysis at engine side before looking at any infrastructure level issues. The initial analysis can be useful in relating infrastructure issues with database performance as well, if there is any.296Views0likes0CommentsPostgres horizontal scaling with elastic clusters on Azure Database for PostgreSQL Flexible server
Elastic clusters (Preview) is a Flexible server feature that enables a database to be scaled out horizontally beyond a single node. Powered by the open-source Citus extension developed by Microsoft, elastic clusters offer horizontal scale out through row and schema-based sharding capabilities. With elastic clusters, you can future proof horizontal scale out, adding nodes as needed as the application grows and outgrows the capacity of a single node. By providing a unified endpoint to the cluster, elastic clusters eliminate the need for complex application-layer sharding, significantly simplifying the management of growing Flexible Server deployments. Online shard management and isolation capabilities are a core part of the services offloading this burden from the application stack. With elastic clusters, scaling out your applications has never been easier or more efficient. Benefits of using elastic clusters (Preview) Familiarity: It’s the same Flexible server you know and love, with the same feature set. Horizontal scaling: Grow your cluster out by adding more Flexible servers – lifting the single node limitation for your application growth. Simplicity: Offload complexity of sharding from your application layer to the managed service. Cost efficiency: You are not paying more than you would for the same amount of Flexible servers as nodes in your elastic cluster with the benefit of simplified management of the fleet. AI ready: Modern AI applications require storage of vast amounts of vectorized data. Their data models, however, are usually simple and scale out very well with sharding models provided by elastic clusters. What are elastic clusters? The elastic clusters feature of Azure Database for PostgreSQL Flexible server is a managed offering of horizontal scaling, powered by the Citus extension. Elastic clusters handle managing and configuring multiple PostgreSQL instances as a single resource, setting up the nodes, and providing cluster level choices for high availability, disaster recovery, compute and storage and resiliency. Nodes in your cluster reside in the same availability zones for optimal performance and the various Flexible server features and capabilities to become cluster aware such as backups, metrics, Entra ID authentication to just name a few. Every node in the elastic cluster is capable of handling DML, you can run your read and write queries on any node on the cluster, and it will automatically fetch the data from their current placement. In many distributed systems, often all queries go through a single node which can make it a bottleneck. With Elastic Clusters, however, when you connect via port 7432, the workload is automatically distributed across all nodes. What is sharding? Sharding is the act of splitting data into logical groups, that can then be moved into separate machines and retrieved individually. Sharding also supports queries that need data from two or more different shard groups living on different nodes. Such queries would be computed individually by each node and then rolled up by the initiating node before providing the result. You can also read more about sharding models here: Sharding models - elastic clusters Schema-based sharding Schema-based sharding is the most intuitive way to understand how this works. Every table in a distributed schema becomes its own shard. Shards can be co-located - moved around between nodes as a group. If your application creates a schema for every tenant in the system (or a microservice) all the data will live together on the same node and even if it moves around, the system will route queries to the correct node. The benefit is a pure architectural lift-and-shift for any application that already uses schema based sharding. Suddenly your legacy application can scale across many machines without a single line of code being changed. This is assuming that your application already uses schema based sharding but for database per tenant models the conversion to a schema-based model is easy. Follow the Tutorial: Design for microservices with elastic cluster to try schema-based sharding yourself! Row-based sharding Row-based sharding is different and requires some DDL modifications and choice of distribution key but with those few changes, it is a very scalable method when there is a need for high density packing of tenants. One way to think about row-based sharding is to paint parallels to table partitioning. With partitioning you decide what column can split the data into sensible chunks that can be queried in isolation – greatly reducing the amount of data you need to weed through to get results. A good example is partitioning by year, knowing that you rarely reach to past year. You then need to make sure that your queries all use the partition key in the WHERE clause, because without it the system doesn’t know how to filter out (prune) partitions that don’t hold the required data – neglecting the benefit of the partitioning. As with partitioning, row-based sharding required determining a distribution key. Selecting a good key can be a challenge itself – something with good cardinality (so data can be spread out), not skewed (so that data doesn’t fully land into one big shard) and logically splitting your tenants (avoiding cross tenant queries). Typical good distribution keys are tenant_id and device_id (to fan out writes) but this will heavily depend on your application. If you go back to the partitioning example, year was a great partition key, but it is a terrible distribution key. It has low cardinality, and only one node per year would be taking writes by not leveraging spreading out data to multiple shards. Instead of distributing on year, use a key with better cardinality in your system – you can still leverage partitioning on top of distribution as partitioned tables can be distributed. Having selected the key, the next step is easy. Just distribute the tables and this is as simple as calling: SELECT create_distributed_table('accounts', 'account_id'); SELECT create_distributed_table('campaigns', 'campaign_id'); There are many table types in elastic clusters, you can read more about them here: Table types - elastic clusters As with schema based sharding, each tenant is assigned into a shard but unlike row-based sharding a shard can be shared among tenants. Like with partitioning you can then think of a shard as a smaller PostgreSQL table and if you put all of them together – you would get the whole thing. In the same vein, as with partitioning if queries ran by the application miss the distribution key in their WHERE clause – they would start hitting all nodes in the cluster to find the required data. In some cases, such fan-out queries are desired and a form of implementing parallelization of query execution among all cluster nodes. In most cases, however, it is more efficient to have queries that filter down to one node. The biggest benefit of row-based sharding is the density of how tight tenants can be packed on the same machine. It’s as good as it gets, which is rows in the same table – the drawback, is the work up front required to decide on the schema and potential query changes in the application. Follow either tutorial: Design multitenant database with elastic cluster or Design a real-time dashboard with elastic cluster, to try out row-based sharding yourself! When things get hot There will be times where either the system needs room to grow (CPU, Storage, connections), or a specific tenant becomes very noisy (making other tenants experience degraded performance). Elastic clusters allow you to address both. In the case of capacity (either CPU, Storage, maximum connections) being reached, just grow your cluster by adding nodes and call trigger online rebalancer. Without blocking your existing workloads, data will be redistributed among the available nodes (including the new node) by disk size, shard count or any strategy that you decide to implement yourself. This is as easy as connecting to the cluster and issuing: SELECT citus_rebalance_start(); You may not need to rebalance depending on how you sharded your data. The new node will be immediately used for new inserts as soon as it becomes online and that may be just enough to redistribute the load on your cluster. When one tenant throws a party, you can decide to move them out to their own node though, sometimes it is easier to move the well behaving tenants if they happen to store less data than the noisy one. With elastic clusters this can be as easy as running: SELECT citus_schema_move('wideworldimporters','10.54.0.254', 7003); Using elastic clusters in Azure Database for PostgreSQL Flexible server Step 1: Create an elastic cluster Begin by setting up a new instance of elastic cluster with Azure Database for PostgreSQL Flexible server Azure portal Quickstart: Create elastic cluster Select your cluster size, up to 10 nodes can be deployed during preview. Review your settings after confirming the compute and storage configuration. Step 2: Choosing a sharding model Citus on Flexible server offers two sharding models: row-based sharding and schema-based sharding. You can learn more about sharding models here: Sharding models - elastic clusters Step 3: Follow one of the tutorials For row-based sharding: Design multitenant database with elastic cluster Design a real-time dashboard with elastic cluster For schema-based sharding: Design for microservices with elastic cluster Frequently Asked Questions In what regions are elastic clusters available? Elastic clusters are currently available in the following regions: East Asia East US North Europe UK South West US West US 3 Please see Limitations of elastic clusters for an up-to date list. Will all features of Flexible server be available on elastic clusters? We aim to feature parity, some cluster incompatible extensions like TimescaleDB may not be available. Other features may need a bit of work to become cluster aware. Are there any limitations I should be aware of? Please see Limitations of elastic clusters Ready to dive in? Get started for free with an Azure free account Azure Database for PostgreSQL Quickstart: Create elastic cluster with Azure portal Learn more Elastic clusters with PostgreSQL Flexible server Distributed Postgres. At any scale. - Citus Data citusdata/citus: Distributed PostgreSQL as an extension3.1KViews5likes3CommentsEnd-to-end workload observability with Query Store for primary and replicas
Query performance doesn’t stop at the primary Most PostgreSQL architectures don’t run on a single node anymore. Reads get offloaded. Replica chains grow. And when performance issues hit, the hardest part is often simple: where did the queries actually run? With the latest query store capabilities in Azure Database for PostgreSQL flexible server, you can now capture workload executed not just on the primary, but also on read replicas—including cascading read replicas—and export the captured runtime stats, wait stats, and query text into Azure Monitor Logs (Log Analytics workspace / LAWS). See the real hotspot: isolate which node (primary vs replica) is slow. Know why: break down time by waits (CPU, I/O, locks) per query. Connect the dots: correlate query IDs to query text, and inspect sampled parameters locally in azure_sys on the primary when you need input context (parameters aren’t exported to LAWS). Centralize analysis: query everything with KQL in LAWS, across servers. What you’ll build This post walks through a reproducible demo that provisions a primary server, a read replica, and a cascading read replica, then runs a TPC-H–based workload across all three to generate query store data you can analyze locally and in Log Analytics. Enable query store capture (including query text) and parameter sampling for parameterized queries. Enable wait sampling so query store can record wait statistics. Export runtime stats, wait stats, and SQL text to LAWS using resource-specific tables. Validate capture on read replicas and cascading read replicas (not just the primary). Prerequisites Azure CLI logged in (az login) and permission to create a resource group, Log Analytics workspace, and PostgreSQL flexible servers. psql and curl available on your machine. PostgreSQL flexible server on General Purpose or Memory Optimized tier (query store and replicas aren’t supported on Burstable). PostgreSQL 14+ to test out cascading replicas. Networking: the script opens firewall access broadly for demos—tighten for production. Architecture (primary + replica chain + LAWS) You’ll deploy four resources: Primary server: read/write node. Read replica (level 1): read-only node created from the primary. Cascading read replica (level 2): read-only node created from replica level 1. Log Analytics workspace (LAWS): central place to query Query Store telemetry across all nodes. If Diagnostic Settings is properly configured, each server streams query store telemetry to LAWS—but how it’s kept locally differs by role. On the primary, query store data is recorded in-memory, then persisted locally in the azure_sys database, and then exported to LAWS. On read replicas (including cascading replicas), query store data is recorded in-memory only and then exported to LAWS. Bottom line: use LAWS for fleet-wide visibility, and use the primary’s azure_sys when you need deep local inspection (like parameter samples). Deploy the demo environment The fastest way to reproduce the scenario is to run the end-to-end bash script which you can download from https://raw.githubusercontent.com/Azure-Samples/azure-postgresql-query-store/refs/heads/main/may2026/script/query_store_demo.sh Save the file to a local directory in your Linux shell, and name the file query_store_demo.sh. To invoke the script, at minimum, you must assign a string password for the administrator login of the instances of the flexible servers it creates, and invoke the script like this: ADMIN_PASSWORD=<Your_Strong_Password> ./query_store_demo.sh Optionally, you can also override default values for other environment variables used by the script: Variable Purpose Default SUBSCRIPTION_ID Azure subscription ID to use (current default subscription) BASE_NAME Base name for all resources (used in naming servers, resource groups, etc.) pgqswait{YYYYMMDDHHMMSS} RESOURCE_GROUP Azure resource group name rg-{BASE_NAME} LOCATION Azure region for resources southeastasia PRIMARY_SERVER Name of primary PostgreSQL server {BASE_NAME}-primary REPLICA_1 Name of first-level read replica {BASE_NAME}-readreplica REPLICA_2 Name of second-level cascading read replica {BASE_NAME}-cascadereadreplica LOG_ANALYTICS_WORKSPACE Log Analytics workspace name law-{BASE_NAME} LOG_ANALYTICS_LOCATION Azure region for Log Analytics workspace southeastasia ADMIN_USER PostgreSQL admin username pgadmin ADMIN_PASSWORD PostgreSQL admin password (REQUIRED) SKU_NAME PostgreSQL server SKU (compute tier) Standard_D4ds_v5 TIER PostgreSQL pricing tier GeneralPurpose STORAGE_SIZE Storage size in GB 64 VERSION PostgreSQL version (minimum 14 for cascading replicas) 17 PRIMARY_DATABASE Initial database name postgres SQL_BASE_URL Base URL for downloading SQL scripts https://raw.githubusercontent.com/Azure-Samples/azure-postgresql-query-store/refs/heads/main/may2026/script/query_store_demo.sh TPCH_DDL_URL URL for TPC-H schema DDL file {SQL_BASE_URL}/schema/tpch_ddl.sql WORKLOAD_REPETITIONS Number of times to execute each workload query (minimum 5) 10 AUTO_APPROVE Skip confirmation prompt and proceed automatically false If, for example, you want to not only pass the ADMIN_PASSWORD but also override the LOCATION, you could do it like this: ADMIN_PASSWORD=<Your_Strong_Password> LOCATION=canadacentral ./query_store_demo.sh In a bit over 1 hour, the script will do the following steps: Step 1 — Provision first part of the infrastructure The infrastructure provisioned in this phase consists of: A resource group in which all resources are deployed. An instance of Log Analytics workspace, where all flexible server instances will send their query store related logs. A primary (read-write) flexible server. Step 2 — Configure primary server Now it's time to configure one new server parameters on your primary server so that query store emits query text to LAWS, so that we can correlate quey IDs to something recognizable. Query IDs are great for aggregation—but you still need the SQL. Turn on query text emission so you can correlate runtime and waits back to the actual statement text. Do this by setting pg_qs.emit_query_text to on. Refer to our documentation to learn how to set the value of a server parameter. Step 3 — Provision second part of the infrastructure The infrastructure provisioned in this phase consists of: A read replica (read-only) whose source is the primary server. A cascade read replica (read-only), whose source is the previously created read replica. Notice that when read replicas are created, they inherit the server parameter values from their source server. Because we have configured query store related settings on the primary server already, the intermediate read replica inherits its server parameters from that primary, and the cascade read replica inherits them from the intermediate replica. Step 4 — Export query store to Log Analytics (LAWS) Now for the payoff, we want to stream the data to Log Analytics so you can query across nodes, build dashboards, and alert. The script configures diagnostic settings on the primary and both replicas to send logs to a Log Analytics workspace using resource-specific tables. This is the key to cross-node visibility: each server exports its own captured telemetry, and you can slice by resource in a single KQL query. Query store runtime stats: execution counts, elapsed time, and other performance counters. Query store wait stats: wait breakdown attributed to queries. Query store SQL text: query text to decode query IDs. Note: Query store parameter samples are not included in the Log Analytics export. Parameters are stored locally per server in azure_sys, and on read replicas azure_sys is read-only—so don’t depend on replicas for parameter inspection. LAWS receives runtime stats, wait stats, and query text. Diagnostics settings for an instance of flexible server can be configured via portal. In the resource menu, under Monitoring, select Diagnostic settings. Add a new diagnostic setting, select a destination Log Analytics workspace, and the individual log catergories which you want to stream to that LAWS, and save the changes. For Destination table it's highly recommended to use Resource specific (one table per signal with proper schema) over Azure diagnostics (legacy one table for everything). With Azure diagnostics, all logs from all resource types land into a single table (AzureDiagnostics). It's a wide table with many columns. New columns get added as services emit new fields. If the 500 column limit is hit, extra fields go into the AdditionalFields column (a dynamic JSON). Querying on attributes stored in that column might have huge performance and query cost impact. The schema is inconsistent and difficult to discover. You must always filter events in that table by ResourceType and Category. On the other hand, with Resource specific, logs are written to separate tables per resource type and category. Therefore, each table has a well-defined schema and columns are strongly typed. Tables are smaller and faster to query. Queries on these tables are simpler don't need filtering by ResourceType and Category. Performance-wise, they also support faster ingestion and faster querying. They also support selecting different table plans and retention settings for each table. And, more importantly, role-based access control (RBAC) permissions can be applied at table level, allowing you to control access to telemetry in a more granular way. Note: If you want to see any of the images in this article in better quality, click on them to see them in their original size. This can also be configured using Azure CLI command az monitor diagnostic-settings create. Make sure that the --export-to-resource-specific parameter is set to true, which is the equivalent of selecting Resource specific for Destination table in portal UI. Setting this parameter to false, would mean that you want to use AzureDiagnostics as the destination table, which we don't recommend using. Step 5 — Run some workload In this phase the script loads a TPC-H schema and executes workload SQL across different nodes so that you can prove replica capture. Query it in Log Analytics Once the workload completed and data was streamed to Log Analytics, you can open your Log Analytics workspace, and start querying the relevant tables. If you don't know how to start issuing queries in a Log Analytics workspace, refer to Get started with log queries in Azure Monitor Logs. In your Log Analytics workspace, when you select Logs in the resource menu, you can access the Queries hub. By default, it should open automatically unless you have configured it to not show, in which case you can open by selecting Queries hub on the top right corner of the Logs home screen. If you add a filter in the queries hub for Resource type equals Azure Database for PostgreSQL Flexible Server, you'll be able to access multiple examples of queries which might help you get started querying the log categories we support for our service. You can run any of them by selecting Run on the summarization card that describes the query or, if you hover the mouse over the card, you can select Load to editor so that the query is copied over to the active query window, and you can run it or modified it further. Following, there are a few more query examples which can be useful to analyze the workload executed in this experiment. Top queries by total time (across all nodes) To get the list of 10 queries with higher duration from the ones that ran on any of the three nodes. KQL PGSQLQueryStoreRuntime | summarize total_time_ms = sum(TotalExecDurationMs) by QueryId, LogicalServerName | top 10 by total_time_ms desc Results Important: Results might be slightly different on each execution of the experiment. Where queries wait on each node List the most frequent wait events observed on user initiated queries across all nodes. KQL PGSQLQueryStoreWaits | join kind=inner (PGSQLQueryStoreRuntime) on QueryId | summarize total_waits_sampled = sum(Calls) by Event, EventType, LogicalServerName | order by total_waits_sampled desc Results Important: Results might be slightly different on each execution of the experiment. Decode query IDs (join runtime stats with SQL text) Top 20 queries the most frequent wait events observed on user initiated queries across all nodes. KQL PGSQLQueryStoreRuntime | join kind=inner (PGSQLQueryStoreQueryText) on QueryId | where QueryType == 'select' | project LogicalServerName, QueryId, TotalExecDurationMs, QueryText | top 20 by TotalExecDurationMs desc Results Important: Results might be slightly different on each execution of the experiment. Compare primary vs replicas (workload distribution) Find total number of query executions and accumulated duration of all those executions for each node. KQL PGSQLQueryStoreRuntime | summarize execs = sum(Calls), total_time_ms = sum(TotalExecDurationMs) by LogicalServerName | order by total_time_ms desc Results Important: Results might be slightly different on each execution of the experiment. Replica-only hotspots (find what’s slow off the primary) Find top 10 queries executed by their aggregated duration, focusing on what was executed on read replicas only. KQL let Replicas = dynamic(["pgqswait20260505220501-readreplica", "pgqswait20260505220501-cascadereadreplica"]); PGSQLQueryStoreRuntime | where LogicalServerName in (Replicas) | summarize total_time_ms = sum(TotalExecDurationMs) by QueryId, LogicalServerName | top 10 by total_time_ms desc Results Important: Results might be slightly different on each execution of the experiment. QPI now supports query store stats collected on replicas You can now use Query Performance Insight workbooks to analyze query store information not only on your primary server, as you were used to, but you can also get that valuable information on your read replicas. Why replica workload capture is a big deal This is the unlock: you can now answer performance questions in replica-heavy architectures without stitching together partial signals. Per-node truth: see the slow queries on the node where they actually ran (primary vs replica vs cascading replica). Faster root cause: runtime + waits gives you “slow” and “why” in one place. Replica tuning that sticks: identify replica-specific bottlenecks (I/O saturation, lock waits, CPU pressure) and tune with evidence. Centralized observability: export to LAWS so you can build dashboards, alerts, and cross-server comparisons with KQL. Unlock query visibility: Access query text without database permissions. Fine grain control on who can view query text: Using resource specific tables in LAWS, you can decide which users can access the table in which text of the queries is kept. Parameter-aware debugging: sampled parameters can help reproduce issues and explain plan changes, but they’re stored locally in azure_sys and not exported to LAWS. In practice, rely on the primary for parameter inspection (replicas have read-only azure_sys). Operational notes (quick but important) Expect a delay: Query store stats and LAWS ingestion aren’t instant. Give it a few minutes after running workload. Mind retention: Query store retention and Log Analytics retention are separate knobs. Tune them to balance troubleshooting value and cost. Production hygiene: don’t use wide-open firewall rules outside of a demo. Clean up When you’re done, delete the resource group: az group delete --name <RESOURCE_GROUP> --yes --no-wait Bottom line Query store in Azure Database for PostgreSQL flexible server now matches how modern architectures run—across primary, read replicas, and cascading replicas—and LAWS gives you a single place to query, compare, and act.406Views13likes0CommentsWhen PostgreSQL v17 Chooses the Wrong Plan: A Deep Dive into CTEs with Data Skew
Common Table Expressions or CTEs provide temporary named result set that can be referenced within single complex SQL statements. CTE offer great flexibility in breaking down complex SQL queries into multiple readable parts and enables recursion. CTEs allows materialized result set which allows users to refer the result set multiple times. PostgreSQL 17 propagates column statistics from MATERIALIZED CTEs into the parent query, allowing the planner to estimate hash aggregation and join cardinalities more accurately. However, when queries combine highly skewed data distributions with runtime randomness (for example, ORDER BY random() LIMIT 1), this increased planner confidence can sometimes lead to execution plans that are correct in theory but catastrophic for specific runtime values. What is Data Skew and how it affects plans? Data skew is an uneven distribution of values in a column, where one or a few values occur far more often than the rest. To choose an efficient execution plan, PostgreSQL’s planner estimates row counts using statistics such as n_distinct, histograms, and most-common-values (MCV) frequencies stored in pg_statistic. If these estimates are off—especially if selectivity is overestimated—query performance can degrade significantly. Baseline Query and Observed Planner Behavior To gauge the performance impact of data skew with PGv17, we generated data into 3 tables relating to Accounts, Purchases and Audit. This baseline query selects a single random HIGH_RISK account and retrieves matching rows from a large audit_logs table via two materialized CTEs. Original Query with selected_account as materialized ( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1 ), audit_purchases as materialized ( select distinct p.account_id from purchases p join selected_account sa on p.account_id = sa.account_id ) select al.* from audit_logs al where exists ( select 1 from audit_purchases ap where al.account_id = ap.account_id); PostgreSQL 14 Execution Plan PostgreSQL 14 treats materialized CTEs as optimization fences and does not propagate detailed statistics into the parent query. As a result, the planner assumes relatively small CTE result sets and consistently favors nested loop joins with parameterized index scans. While these estimates are not strictly accurate, this conservative planning approach prevents full-table scans and avoids catastrophic performance under extreme data skew. In this workload, PostgreSQL 14 completes the query in approximately 7 ms. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Nested Loop (cost=1412.94..212757485.16 rows=1500600064 width=27) (actual time=5.226..7.192 rows=30 loops=1) Buffers: shared hit=277 read=13 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.460..4.460 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.459..4.459 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.011..3.533 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> HashAggregate (cost=317.44..355.57 rows=3813 width=4) (actual time=4.735..4.741 rows=1 loops=1) Group Key: p.account_id Batches: 1 Memory Usage: 217kB Buffers: shared hit=274 read=1 -> Nested Loop (cost=0.44..292.52 rows=9966 width=4) (actual time=4.722..4.726 rows=10 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.461..4.461 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..192.84 rows=9966 width=4) (actual time=0.260..0.262 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> HashAggregate (cost=85.79..87.79 rows=200 width=4) (actual time=4.743..4.744 rows=1 loops=1) Group Key: ap.account_id Batches: 1 Memory Usage: 40kB Buffers: shared hit=274 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..76.26 rows=3813 width=4) (actual time=4.737..4.742 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..817780.34 rows=24600001 width=27) (actual time=0.480..2.438 rows=30 loops=1) Index Cond: (account_id = ap.account_id) Buffers: shared hit=3 read=12 Planning Time: 1.121 ms Execution Time: 7.253 ms (37 rows) PostgreSQL 17 Execution Plan PostgreSQL 17 correctly estimates that the audit_purchases CTE may contain thousands of rows and that joining it with audit_logs could produce tens of millions of matches. Based on these estimates, the planner selects a Hash Semi Join with a Sequential Scan on audit_logs, which is optimal when many rows are expected to match. However, at runtime the selected account produces only ~30 matching audit rows. Despite this, PostgreSQL must execute the chosen plan fully, scanning millions of rows from disk. This results in an execution time of 521,625 ms (~8.7 minutes). Importantly, this behavior is not a cost estimation bug—the estimates are statistically correct. The failure occurs because the PostgreSQL 17 planner estimates a large matching result set based on the cardinality of values produced by the audit_purchases CTE, and consequently chooses a plan optimized for that estimated volume rather than for the single runtime-selected account. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Semi Join (cost=1691.60..93353890.16 rows=3001396992 width=26) (actual time=86577.858..521625.286 rows=30 loops=1) Hash Cond: (al.account_id = ap.account_id) Buffers: shared hit=28879 read=22040419 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.486..4.487 rows=1 loops=1) Buffers: shared hit=274 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.485..4.486 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=274 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.021..3.551 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> HashAggregate (cost=658.72..673.28 rows=1456 width=4) (actual time=4.907..4.912 rows=1 loops=1) Group Key: p.account_id Batches: 1 Memory Usage: 73kB Buffers: shared hit=277 read=1 -> Nested Loop (cost=0.44..606.86 rows=20742 width=4) (actual time=4.898..4.902 rows=10 loops=1) Buffers: shared hit=277 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.487..4.487 rows=1 loops=1) Buffers: shared hit=274 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=0.410..0.411 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> Seq Scan on audit_logs al (cost=0.00..52082989.92 rows=3001396992 width=26) (actual time=0.012..286755.571 rows=3001400050 loops=1) Buffers: shared hit=28602 read=22040418 -> Hash (cost=29.12..29.12 rows=1456 width=4) (actual time=4.919..4.920 rows=1 loops=1) Buckets: 2048 Batches: 1 Memory Usage: 17kB Buffers: shared hit=277 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..29.12 rows=1456 width=4) (actual time=4.908..4.911 rows=1 loops=1) Buffers: shared hit=277 read=1 Planning: Buffers: shared hit=197 Planning Time: 1.831 ms Execution Time: 521625.433 ms (38 rows) Forced nested loop Disabling hash joins at the session level forces PostgreSQL 17 to revert to a nested loop plan using index scans. This diagnostic step completes in ~8 ms, confirming that index-based execution is sufficient for the actual runtime workload. However, disabling planner features is not suitable for production use and serves only to validate the root cause. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=1677.62..3361930001.36 rows=3001396992 width=26) (actual time=5.513..8.194 rows=30 loops=1) Buffers: shared hit=277 read=12 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.524..4.524 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.523..4.523 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.014..3.608 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> HashAggregate (cost=658.72..673.28 rows=1456 width=4) (actual time=4.890..4.894 rows=1 loops=1) Group Key: p.account_id Batches: 1 Memory Usage: 73kB Buffers: shared hit=274 read=1 -> Nested Loop (cost=0.44..606.86 rows=20742 width=4) (actual time=4.882..4.885 rows=10 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.525..4.525 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=0.354..0.356 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> HashAggregate (cost=32.76..47.32 rows=1456 width=4) (actual time=4.896..4.899 rows=1 loops=1) Group Key: ap.account_id Batches: 1 Memory Usage: 73kB Buffers: shared hit=274 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..29.12 rows=1456 width=4) (actual time=4.892..4.895 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..1782455.82 rows=52656088 width=26) (actual time=0.615..3.283 rows=30 loops=1) Index Cond: (account_id = ap.account_id) Buffers: shared hit=3 read=11 Planning Time: 0.184 ms Execution Time: 8.252 ms (37 rows) Fix Strategies While the fix isn’t straightforward, we attempt to modify the query plan through query rewrite using following methods: Lateral Join with Offset 0 Limit on Purchases Using subqueries Lateral Join with Offset 0 Adding OFFSET 0 inside a LATERAL subquery introduces an optimizer barrier that prevents join reordering. This forces PostgreSQL to execute a parameterized nested loop, passing the runtime account ID into an index scan on audit_logs. With this rewrite, PostgreSQL 17 executes the query in 6.519 ms, a 99.9988% improvement over the original plan. with selected_account as materialized ( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1 ), audit_purchases as materialized ( select p.account_id from purchases p join selected_account sa on p.account_id = sa.account_id ) select al.* from audit_purchases ap join lateral ( select * from audit_logs al where al.account_id = ap.account_id offset 0) al on true; Execution plan QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=1578.44..42162640233.98 rows=1092192577296 width=26) (actual time=4.980..6.481 rows=300 loops=1) Buffers: shared hit=403 read=12 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.296..4.297 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.295..4.296 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.013..3.357 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> Nested Loop (cost=0.44..606.86 rows=20742 width=4) (actual time=4.585..4.588 rows=10 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.298..4.298 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=0.286..0.287 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..414.84 rows=20742 width=4) (actual time=4.587..4.591 rows=10 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..1506157.19 rows=52656088 width=26) (actual time=0.040..0.185 rows=30 loops=10) Index Cond: (account_id = ap.account_id) Buffers: shared hit=129 read=11 Planning: Buffers: shared hit=8 Planning Time: 0.238 ms Execution Time: 6.519 ms (31 rows) Limit on Purchases Applying LIMIT 1 to the audit_purchases CTE bounds its cardinality to a single row. This makes nested loops cheaper than hash joins during planning. PostgreSQL 17 selects an index‑driven execution plan and completes in 7.742 ms, yielding a 99.9985% improvement. with selected_account as materialized ( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1 ), audit_purchases as materialized ( select p.account_id from purchases p join selected_account sa on p.account_id = sa.account_id limit 1 ) select al.* from audit_logs al where exists (select 1 from audit_purchases ap where al.account_id=ap.account_id); Execution plan QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=972.07..933697.99 rows=24600001 width=27) (actual time=5.693..7.686 rows=30 loops=1) Buffers: shared hit=276 read=13 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.816..4.817 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.816..4.817 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.014..3.825 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> Limit (cost=0.44..0.47 rows=1 width=4) (actual time=5.133..5.134 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Nested Loop (cost=0.44..292.52 rows=9966 width=4) (actual time=5.132..5.132 rows=1 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.817..4.817 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..192.84 rows=9966 width=4) (actual time=0.313..0.313 rows=1 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> HashAggregate (cost=0.02..0.03 rows=1 width=4) (actual time=5.138..5.139 rows=1 loops=1) Group Key: ap.account_id Batches: 1 Memory Usage: 24kB Buffers: shared hit=274 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..0.02 rows=1 width=4) (actual time=5.135..5.135 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..686726.47 rows=24600001 width=27) (actual time=0.552..2.534 rows=30 loops=1) Index Cond: (account_id = ap.account_id) Buffers: shared hit=2 read=12 Planning Time: 0.216 ms Execution Time: 7.742 ms (35 rows) Using subqueries Rewriting the query using scalar subqueries converts the selected account ID into an InitPlan. The resulting value is then used as a runtime parameter in index scans against audit_logs. This eliminates join reordering opportunities and guarantees a parameterized access path. Both scalar subquery variants complete in approximately 7.3–7.6 ms, achieving over 99.9986% improvement. Subquery Example 1 select * from audit_logs al where al.account_id in (select p.account_id from purchases p where p.account_id=( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1)); Execution plan QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop Semi Join (cost=972.02..1302907.98 rows=24600001 width=27) (actual time=5.257..7.260 rows=30 loops=1) Buffers: shared hit=277 read=12 InitPlan 1 (returns $0) -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.612..4.613 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.611..4.611 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.012..3.674 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..686755.98 rows=24600001 width=27) (actual time=5.067..7.059 rows=30 loops=1) Index Cond: (account_id = $0) Buffers: shared hit=274 read=11 -> Materialize (cost=0.44..242.67 rows=9966 width=4) (actual time=0.006..0.006 rows=1 loops=30) Buffers: shared hit=3 read=1 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..192.84 rows=9966 width=4) (actual time=0.187..0.188 rows=1 loops=1) Index Cond: (account_id = $0) Heap Fetches: 0 Buffers: shared hit=3 read=1 Planning Time: 0.160 ms Execution Time: 7.286 ms (24 rows) Subquery Example 2 select * from audit_logs al where al.account_id=(select distinct p.account_id from purchases p where p.account_id=( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1)); Execution plan QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Index Scan using idx_account_audit on audit_logs al (cost=972.04..1507128.65 rows=52656088 width=26) (actual time=5.510..7.637 rows=30 loops=1) Index Cond: (account_id = (InitPlan 2).col1) Buffers: shared hit=277 read=12 InitPlan 2 -> Limit (cost=971.44..971.46 rows=1 width=4) (actual time=4.901..4.902 rows=1 loops=1) Buffers: shared hit=274 read=1 InitPlan 1 -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.455..4.456 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.455..4.455 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.018..3.522 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=4.900..4.900 rows=1 loops=1) Index Cond: (account_id = (InitPlan 1).col1) Heap Fetches: 0 Buffers: shared hit=274 read=1 Planning: Buffers: shared hit=4 Planning Time: 0.154 ms Execution Time: 7.663 ms (25 rows) Query execution summary The table below summarizes execution time improvements across all tested rewrites relative to the original PostgreSQL 17 and 14 query execution plans. Setup / Rewrite Execution Time (ms) Speed‑up Factor (×) PG17 original (Hash Semi Join + Seq Scan) 521,625.433 1.0× PG14 original query shape 7.253 71,919× Forced nested loop (hashjoin off – diagnostic) 8.252 63,212× LATERAL JOIN + OFFSET 0 6.519 80,016× LIMIT on purchases 7.742 67,376× Subquery – IN with InitPlan (Example 1) 7.286 71,593× Subquery – scalar subquery (Example 2) 7.663 68,071× Key Takeaways While PostgreSQL 17’s planner improvements are correct, intentional, and beneficial for most workloads. However, queries that combine CTEs, runtime randomness, and heavily skewed data can cause the planner to select globally optimal plans that are locally inefficient. When runtime values drastically narrow result sets, query rewrites that force parameterized index scans—such as LATERAL joins, scalar subqueries, or bounded CTEs—provide predictable and stable performance.273Views0likes0CommentsCascading Read Replicas Now Generally Available!
We’re excited to announce the General Availability of cascading read replicas in Azure Database for PostgreSQL. This capability allows you to create read replicas for your Azure Database for PostgreSQL instance not only from a primary server, but also from existing read replicas, enabling multi‑level replication chains. Coordinating read‑heavy database workloads across multiple regions can be challenging, especially when you’re trying to deliver low‑latency read response experiences to users spread across different geographic locations. One effective way to address this is by placing read replicas closer to where your users are, allowing applications to serve read requests with significantly reduced latency and improved performance. What are cascading read replicas? With cascading read replicas, you can scale read‑intensive workloads more effectively, distribute read traffic efficiently, and support advanced deployment topologies such as globally distributed applications. Each read replica can act as a source for additional replicas, forming a tree‑like replication structure. For example, if your primary server is deployed in one region, you can create direct replicas in nearby regions and then cascade additional replicas to more distant locations. This approach helps spread read traffic evenly while minimizing latency for users around the world. We support up to 2 levels of replication with this feature. Level 1 will be all the read replicas and level 2 will be cascading read replicas. Why use cascading read replicas? Improved scalability Cascading read replicas support multi‑level replication, making it easier to handle high volumes of read traffic without overloading a single instance by scaling up to 30 read replicas. Geographic distribution By placing replicas closer to your global user base, you can significantly reduce read latency and deliver faster, more responsive application experiences. Efficient read traffic distribution Distributing read workloads across multiple replicas helps balance load, improving overall performance and reliability. Additionally, cascading read replicas offer operational flexibility. If you observe replication lag, you can perform a switchover operation between a cascading read replica with its source or intermediate replica, helping you maintain optimal performance and availability for your replicas. How does replication work with cascading read replicas? The primary server acts as a source for the read replica. Data is asynchronously replicated to these replicas. When we add cascading replicas, the previous replicas act as a data source for replication. In the diagram above, “primary-production-server” is the primary server with three read replicas. One of these replicas, “readreplica01”, serves as the source for another read replica, “readreplica11” which is a cascading read replica. With cascading read replicas, you can add up to five read replicas per source and replicate data across two levels, as shown in the diagram. This allows you to create up to 30 read replicas in total five read replicas directly from the primary server, and up to 25 additional replicas at the second level (each second-level replica can have up to five read replicas). If you notice replication lag between an intermediate read replica and a cascading read replica, you can use a switchover operation to swap “readreplica01” and “readreplica11”, helping reduce the impact of lag. To learn more about cascading read replicas, please refer to our documentation: Cascading read replicas Deploying cascading read replicas on Azure portal Navigate to the “Replication” tab and then click on “Create replica” highlighted in red as shown below: After creating a read replica as the below screenshot shows that you have 1 read replica that is attached to the primary instance. Click on the created replica and navigate to the replication tab, source server is “read-replica-01” and we will be creating a cascading read replica under this. Once cascading read replica is created you can see the role of “read-replica-01” has now changed to Source, Replica. You can perform site swap operation by clicking on the promote button for cascading read replica. Deploy cascading read replica with terraform: Before you start, make sure you have: An existing primary PostgreSQL Flexible Server At least one read replica already created from the primary AzureRM provider with latest version Proper permissions on the Azure subscription and resource group Configure the AzureRM Provider: Start by configuring the AzureRM provider in your Terraform project. terraform { required_providers { azurerm = { source = "hashicorp/azurerm" version = "~> 3.80" } } } provider "azurerm" { features {} } Reference the existing read replica server using the data block to reference the replica server. data "azurerm_postgresql_flexible_server" "source_replica" { name = "my-read-replica-1" resource_group_name = "my-resource-group" } Now create a new PostgreSQL Flexible Server and point it to the replica using create_source_server_id. resource "azurerm_postgresql_flexible_server" "cascading_replica" { name = "my-cascading-replica" resource_group_name = "my-resource-group" location = data.azurerm_postgresql_flexible_server.source_replica.location version = data.azurerm_postgresql_flexible_server.source_replica.version delegated_subnet_id = data.azurerm_postgresql_flexible_server.source_replica.delegated_subnet_id private_dns_zone_id = data.azurerm_postgresql_flexible_server.source_replica.private_dns_zone_id create_mode = "Replica" create_source_server_id = data.azurerm_postgresql_flexible_server.source_replica.id storage_mb = 32768 sku_name = "Standard_D4s_v3" depends_on = [ data.azurerm_postgresql_flexible_server.source_replica ] } Apply the Terraform Configuration terraform init terraform plan terraform apply Key Considerations Cascading read replicas allow for up to 5 read replicas and two levels of replication. Creating cascading read replicas is supported in PostgreSQL version 14 and above. Promote operation is not supported for intermediate read replicas with cascading read replicas. Conclusion Cascading read replicas in Azure Database for PostgreSQL offer a scalable way to distribute your read traffic across the same and different regions, reducing the read workload on primary database. For globally distributed applications, this can improve read latency as well as resilience and performance. This design supports horizontal scaling as your application demand grows, ensuring you can handle a high volume of read requests without compromising speed. Get started with this feature today and scale your read workloads.337Views1like0CommentsHandling Unique Constraint Conflicts in Logical Replication
Authors: Ashutosh Sharma, Senior Software Engineer, and Gauri Kasar, Product Manager Logical replication can keep your PostgreSQL environments in sync, helping replicate selected tables with minimal impact on the primary workload. But what happens when your subscriber hits a duplicate key error and replication grinds to a halt? If you’ve seen a unique‑constraint violation while replicating between Azure Database for PostgreSQL servers, you’re not alone. This blog covers common causes, prevention tips, and practical recovery options. In PostgreSQL logical replication, the subscriber can fail with a unique-constraint error when it tries to apply a change that would create a duplicate key. duplicate key value violates unique constraint Understanding why this happens? When an INSERT or UPDATE would create a value that already exists in a column (or set of columns) protected by a UNIQUE constraint (including a PRIMARY KEY). In logical replication, this most commonly occurs because of local writes on the subscriber or if the table is being subscribed from multiple publishers. These conflicts are resolved on the subscriber side. Local writes on the subscriber: a row with the same primary key/unique key is inserted on the subscriber before the apply worker processes the corresponding change from the publisher. Multi-origin / multi-master without conflict-free keys: two origins generate (or replicate) the same unique key. Initial data synchronization issues: the subscriber already contains data when the subscription is created with initial copy enabled, resulting in duplicate inserts during the initial table sync. How to avoid this? Avoid local writes on subscribed tables (treat the subscriber as read-only for replicated relations). Avoid subscribing to the same table from multiple publishers unless you have explicit conflict handling and a conflict-free key design. Enabling server logs can help you identify and troubleshoot unique‑constraint conflicts more effectively. Refer to the official documentation to configure and access PostgreSQL logs. How to handle conflicts (recovery options) Option 1: Delete the conflicting row on the subscriber Use the subscriber logs to identify the key (or row) causing the conflict, then delete the row on the subscriber with a DELETE statement. Resume apply and repeat if more conflicts appear. Option 2: Use conflict logs and skip the conflicting transaction (PostgreSQL 17+) Starting with PostgreSQL 17, logical replication provides detailed conflict logging on the subscriber, making it easier to understand why replication stopped and which transaction caused the failure. When a replicated INSERT would violate a non‑deferrable unique constraint on the subscriber for example, when a row with the same key already exists the apply worker detects this as an insert_exists conflict and stops replication. In this case, PostgreSQL logs the conflict along with the transaction’s finish LSN, which uniquely identifies the failing transaction. ERROR: conflict detected on relation "public.t2": conflict=insert_exists ... in transaction 754, finished at 0/034F4090 ALTER SUBSCRIPTION <subscription_name> SKIP (lsn = '0/034F4090'); Option 3: Rebuild (re-sync) the table Rebuilding (re‑syncing) a table is the safest and most deterministic way to resolve logical replication conflicts caused by pre‑existing data differences or local writes on the subscriber. This approach is especially useful when a table repeatedly fails with unique‑constraint violations and it is unclear which rows are out of sync. Step 1 (subscriber): Disable the subscription. ALTER SUBSCRIPTION <subscription_name> DISABLE; Step 2 (subscriber): Remove the local copy of the table so it can be re-copied. TRUNCATE TABLE <conflicting_table>; Step 3 (publisher): Ensure the publication will (re)send the table (one approach is to recreate the publication entry for that table). ALTER PUBLICATION <pub_with_conflicting_table> DROP TABLE <conflicting_table>; CREATE PUBLICATION <pub_with_conflicting_table_rebuild> FOR TABLE <conflicting_table>; Step 4 (subscriber): Create a new subscription (or refresh the existing one) to re-copy the table. CREATE SUBSCRIPTION <sub_rebuild> CONNECTION '<connection_string>' PUBLICATION <pub_with_conflicting_table_rebuild>; Step 5 (subscriber): Re-enable the original subscription (if applicable). ALTER SUBSCRIPTION <subscription_name> ENABLE; Conclusion In most cases, these conflicts occur due to local changes on the subscriber or differences in data that existed before logical replication was fully synchronized. It is recommended to avoid direct modifications on subscribed tables and ensure that the replication setup is properly planned, especially when working with tables that have unique constraints.207Views1like0Comments