postgres
108 TopicsAnnouncing new security, maintenance and analytics features for PostgreSQL at Microsoft Build 2026
At Microsoft Build 2026, we’re announcing a major wave of PostgreSQL innovation across Azure. Alongside the public preview of Azure HorizonDB, we’re delivering a broad set of enhancements for our fully managed open-source PostgreSQL service: Azure Database for PostgreSQL flexible server. These updates span performance, analytics, security, operations, resilience and migration - helping you build faster, operate with more control, secure your workloads, and modernize with confidence. Here’s a quick tour of the top flexible server announcements at Build 2026. Feature Highlights V6 SKU with local SSD storage (NVMe) pg_duckdb Extension pg_ivm Extension Defender Security assessments temporal_tables Extension Cross-tenant CMK Automatic Entra token refresh libraries New Powershell module: Az.PostgreSQLFlexibleServer More control over planned maintenance Pre-Upgrade validation checks New Built-in Grafana dashboards Chaos Studio supports Azure Database for PostgreSQL AI-assisted Oracle to PostgreSQL migration Migration Service for Azure Database for PostgreSQL improvements (EDB, AlloyDB) Performance, Scale & Analytics V6 SKU with local SSD storage (NVMe) Generally Available by the end of June V6 Compute SKUs are built to handle your largest workloads, delivering high performance, massive scale, and better price performance. Powered by 5th Gen Intel® Xeon® processor and AMD's fourth Generation EPYC™ 9004 processors you can scale up to 192 vCores and 1.8 TiB of memory. With NVMe-backed local SSD storage and support for high-performance storage options such as Premium SSD v2, you can achieve high IOPS and throughput for demanding, IO-intensive PostgreSQL workloads. The Intel & AMD v6 SKUs with local SSD (NVMe) will be Generally Available by the end of June. Learn more: Compute options in Azure Database for PostgreSQL. pg_duckdb Extension Generally Available The pg_duckdb extension enables you to accelerate high-performance analytics and data-intensive applications with DuckDB’s SQL engine running inside your Postgres server. We’re pleased to announce pg_duckdb is now generally available in Azure Database for PostgreSQL. The latest version builds on the preview with the latest DuckDB engine improvements and optimized performance. This version adds vectorized execution for faster analytical queries, delivering significant improvements in aggregation performance, along with new support for writing to Azure Blob Storage and querying Parquet data directly from PostgreSQL. These capabilities enable high-performance analytics on your external data and simplify data processing workflows. Learn more: pg_duckdb. pg_ivm Extension Generally Available Materialized views are a useful way to optimize performance for queries that run regularly, but if underlying data becomes stale the result set needs to be recomputed. With the pg_ivm extension you can automatically maintain materialized views as the underlying data changes. This is particularly valuable for large datasets with small incremental changes that need real-time freshness, like dashboards, catalog analytics and SaaS usage reporting. We are pleased to announce the pg_ivm extension is now generally available in Azure Database for PostgreSQL. Learn more: pg_ivm. Security, Auditing & Identity Defender security assessments Preview Microsoft Defender Security Assessments for Azure Database for PostgreSQL enables continuous evaluation of your database security posture, helping identify vulnerabilities and misconfigurations across server and database configurations. Previously limited to reactive threat detection, in the latest preview release, Defender now provides proactive, risk-based insights through assessments tailored to PostgreSQL-specific best practices, delivering more relevant and actionable guidance. This helps you strengthen your security baseline, prioritize remediation, and align with best practices and compliance requirements. Learn more: https://aka.ms/Defender-Assessments-for-PG-Preview temporal_tables Extension Generally Available We’ve had many customer requests to support the temporal_tables extension, which provides built-in support for tracking and querying historical changes to data over time. Temporal tables are now generally available in Azure Database for PostgreSQL. With this extension enabled you can easily perform time-based queries, audit data changes, and maintain historical records without building custom tracking logic, simplifying application development and compliance scenarios. Learn more: temporal_tables Cross-tenant CMK Preview Azure Database for PostgreSQL now supports cross-tenant customer-managed keys (CMK) in public preview, allowing you to encrypt your data at rest using an Azure Key Vault key that resides in a separate Microsoft Entra tenant from the database service. This feature is designed for SaaS providers and enterprises that need to maintain strict separation of duties and ownership of encryption keys, enabling you to retain full control over key lifecycle management while PostgreSQL runs in a service provider’s tenant. Learn more: Data encryption at rest in Azure Database for PostgreSQL Automatic Entra token refresh libraries Preview We’re making it easier to use Entra ID authentication with Azure Database for PostgreSQL throughout the application stack by introducing new token refresh libraries for .NET, JavaScript, and Python. With Entra ID, access tokens are short-lived which can make managing their lifecycle complex in real-world applications. Developers need to be aware of token refresh and build additional handling around token expiration, connection retry, and session continuity. These new libraries remove that friction. By handling Entra token refresh seamlessly in the background, they allow applications to stay connected without interruption and with no custom logic required. The result is a simpler development experience and more resilient applications, especially for long-running or connection-heavy workloads. Across languages, the libraries provide a consistent and streamlined way to adopt secure, passwordless authentication, helping teams focus more on building their applications and less on managing authentication. Learn more: .NET, JavaScript, and Python. Operations, Maintenance & Monitoring New Powershell module: Az.PostgreSQLFlexibleServer Generally Available We’re excited to introduce the newly renamed Az.PostgreSQLFlexibleServer PowerShell module, delivering a streamlined experience for managing Azure Database for PostgreSQL with PowerShell. Building on the capabilities of the previous Az.PostgreSql module, the updated module aligns with the new features in the 2026-01-01 preview REST API. This module brings support for PostgreSQL 18, elastic clusters for scalable workloads and a range of enhancements designed to simplify management and improve performance. Whether you're provisioning new deployments or managing complex environments, this module ensures you can take full advantage of the latest platform capabilities directly from PowerShell. To learn more, visit our official documentation on PowerShell: Az.PostgreSql Module | Microsoft Learn More control over planned maintenance Generally Available We’ve seen many requests to provide more control when a maintenance update is applied to Azure Database for PostgreSQL. Sometimes when a critical workload is running you want to apply the maintenance when you’re ready. Announcing general availability this week, we’re building on the existing System and Custom maintenance window options and adding new self-service maintenance capabilities to the Azure portal. You can now reschedule upcoming maintenance updates for up to two weeks and apply maintenance on demand at a time that suits you. You can also view scheduled maintenance and review your server’s maintenance history after updates are complete. These options help you better align maintenance with your business schedules, reduce disruption during critical workload periods, and minimize the need for support-driven deferral requests. CLI and API support are coming soon. Learn more: https://aka.ms/azure-postgres-reschedule-maintenance Pre-Upgrade validation checks Preview Major version upgrades are critical for staying current with PostgreSQL features, security updates, and performance improvements, but you often discover blockers only after starting the upgrade workflow. Pre-Upgrade Validation Checks lets you validate upgrade readiness before initiating the actual upgrade by running Azure-specific upgrade checks and PostgreSQL pg_upgrade --check validations independently. The shift is simple: you can identify and fix upgrade blockers before the upgrade window begins. The feature surfaces actionable issues across configurations, extensions, dependencies, replication slots, event triggers, and other upgrade-sensitive objects. You can fix blockers, re-run validation until all checks pass, and proceed with the upgrade with greater predictability. Learn more: https://aka.ms/pg-flex-upgrade-checks New Built-in Grafana dashboards Generally Available boards — no setup, no extra cost, and no separate service to manage. Grafana dashboards are now built directly into the Azure portal for Azure Database for PostgreSQL - no setup, no extra cost, and no separate service to manage. You can open your PostgreSQL resource in the portal and immediately access prebuilt dashboards for key health and performance signals such as CPU, memory, storage, IOPS, connections, transactions, and availability. The key value is metrics + logs in one place. You can quickly correlate performance spikes with PostgreSQL logs, understand what changed, and troubleshoot faster using the familiar Grafana experience. Dashboards can also be customized, saved to your subscription, and shared across teams for ongoing operations. Learn more: https://aka.ms/azure-postgres-dashboards-grafana Resilience & Business Continuity Chaos Studio supports Azure Database for PostgreSQL Preview No matter how much you prepare, you only really know how good your database disaster recovery plan is when something breaks. With Chaos Studio support for Azure Database for PostgreSQL, you can simulate zone-down scenarios on PostgreSQL HA-enabled instances and validate the resilience of your mission-critical workloads. With Chaos Studio integration, you can proactively test failover behavior and gain confidence in how your applications respond to real-world zonal failures. This feature is currently available through a gated private preview. To get started, submit your subscription details using the form. Once reviewed, our team will enable the feature for your subscription, with guidance to help you begin testing. Getting started is simple: Create a Chaos Studio workspace via the Chaos Studio portal and configure your subscription, resource group, and region. Define the scope and assign the required managed identity and permissions. Review and verify your workspace setup. Browse available scenarios and select the PostgreSQL zone-down scenario. Configure the test (name, duration), then run it from My Library to begin validating failover behavior. With just a few steps, you’ll be able to simulate real-world failure conditions and gain confidence in your application’s resilience. To get started, please submit your details using this link: Private Preview Support for Chaos Studio Migration & Modernization AI-assisted Oracle to PostgreSQL migration Generally Available AI-assisted migration tooling has dramatically lowered the bar for moving between different databases and is changing the way people look at the return on investment for migration. The VS Code PostgreSQL extension comes with AI-Assisted migration tooling which converts Oracle schema and application code to Azure Database for PostgreSQL. This tooling uses GitHub Copilot, Microsoft Foundry, and custom Language Model tools to convert Oracle schema, database code and client applications into the PostgreSQL equivalents, and validates every change against a running flexible server instance. Learn more: Schema conversion, App conversion. Migration Service for Azure Database for PostgreSQL improvements (EDB, AlloyDB) Generally Available We’ve added AlloyDB and EDB Extended Server as new sources for migrating to PostgreSQL in the Azure Database for PostgreSQL Migration Service, with support for both online and offline migration support. Learn more: Migrate from AlloyDB, Migrate from EDB. Looking ahead That wraps up the Build 2026 announcements for Azure Database for PostgreSQL flexible server. There are also many great PostgreSQL technical sessions at Build this week, covering cloud-native app and AI development. To find out more, here's a link to the Build session catalog for PostgreSQL sessions: https://aka.ms/Postgres-on-Azure_Build-2026. We'll continue to build out our roadmap over the coming months to deliver on your asks to improve the performance, security and stability of your PostgreSQL workloads. Check the Microsoft Blog for PostgreSQL for a regular monthly recap where we share the latest enhancements and product updates.378Views2likes0CommentsSELECT * FROM build2026_sessions WHERE postgres = true;
Microsoft Build 2026 is around the corner, and this year it’s shaping up to be a big one for PostgreSQL experts and enthusiasts. If you’re a developer working with Postgres, or just love exploring new database technology, there's plenty to get excited about. Microsoft’s new cloud-first evolution of PostgreSQL, Azure HorizonDB, alongside sessions featuring Azure Database for PostgreSQL, will highlight how Postgres is powering the next wave of AI-driven applications. A new horizon in Postgres Build 2026 arrives at a time when the role of databases in modern apps is evolving rapidly. From enabling AI model integration to scaling seamlessly across the cloud, PostgreSQL developers today are dealing with more complex demands than ever. That’s why Azure HorizonDB – Microsoft’s new cloud-native PostgreSQL service – is generating so much buzz ahead of Build. What is Azure HorizonDB? In short, it’s a reimagined version of PostgreSQL designed for cloud-scale performance and AI-era workloads. Azure HorizonDB, introduces a distributed architecture that decouples compute and storage, delivering sub-millisecond latencies and three times the throughput of self-managed Postgres at massive scale. It aims to preserve Postgres’s beloved features and SQL ecosystem while adding next-generation capabilities: built-in vector indexing for high-speed AI/ML retrieval, the ability to run AI models and vector operations directly in the database, and multi-zone replication for resilience. For Postgres developers, this means less time stitching together external data stores or machine learning services – and more time building powerful apps on a unified platform that’s both familiar and built for the future. The bottom line: Microsoft Build 2026 is an ideal opportunity for developers to see Azure HorizonDB in action, learn best practices for modern PostgreSQL architectures, and understand how to leverage Postgres in new scenarios like generative AI and multi-agent applications. Read on for a rundown of sessions covering these topics, complete with what you’ll learn from each one. Top sessions for PostgreSQL databases on Azure Below are key sessions tailored for PostgreSQL users and those interested in Azure HorizonDB, with session types and highlights of what you’ll gain by attending. 🎤 Breakout: From Rows to Reasoning: Designing Databases for AI Apps and Agents (BRK223, 45 min, in-person and digital options) Discover how to architect databases that can power tomorrow’s intelligent applications. This technical breakout will show how AI-ready databases can move beyond plain transactions. You’ll see live demos of integrating transactional, analytical, and vector data in one unified platform, with Azure’s new database capabilities, including Azure HorizonDB. Learn how to simplify your stack by eliminating separate analytics engines or vector stores. The session will highlight patterns that reduce data movement and latency so your apps can efficiently reason over live data with minimal complexity. 🧪 Hands-on lab: Create Advanced Postgres-Powered Agentic Apps with Azure HorizonDB (LAB511, 75 min, in person and digital options) Roll up your sleeves and get hands-on building a real multi-agent AI application with Postgres. In this advanced lab, you’ll create a production-ready AI agent powered by Azure HorizonDB as an all-in-one data, search, and intelligence layer. Experiment with retrieval-augmented generation (RAG) by combining semantic vector search (DiskANN) with traditional SQL queries right inside the database. Implement hybrid search and agent workflows without resorting to external vector databases or glue code – thanks to HorizonDB’s built-in vector indexing and in-database AI model capabilities. This lab is perfect for developers who want to experience how HorizonDB can simplify your stack and boost performance for AI-driven apps. Multiple hands-on labs are offered to suite your schedule. 💻 Demo: Simplify App Dev with Cloud-Native PostgreSQL in Azure HorizonDB (DEM364, 25 min, in-person and digital options) See how to cut your development time and complexity with built-in AI and search features in Postgres. This fast-paced demo shows how Azure HorizonDB helps eliminate the need for separate search engines and AI services by pulling those capabilities straight into PostgreSQL. Expect to learn how you can run hybrid vector + keyword queries using SQL, integrate AI models directly from within the database, and apply full-text search (BM25) and semantic ranking to get smarter results. If you’re eager to deliver intelligent apps faster, with fewer moving parts, this session will show how HorizonDB simplifies your architecture without sacrificing performance. ⚡Lightning Talk: Cloud-Native PostgreSQL, Rebuilt for Scale: Azure HorizonDB (LTG413, 15 min, in-person only) Get a rapid-fire introduction to the architecture behind HorizonDB’s eye-popping performance. This short talk dives into how HorizonDB re-architects core PostgreSQL to deliver effortless scale out and blazing speed. Learn how decoupled compute and storage, predictive caching, and multi-region replication combine to achieve sub-millisecond query latencies and 3× higher throughput than standard Postgres. If you care about performance tuning and high-scale database design, don’t miss this quick primer on the tech under HorizonDB’s hood. 👥 Interactive Table Talk: Scaling PostgreSQL for AI Apps: Patterns and Tradeoffs (TT622, 45 min, in-person only) Bring your questions and ideas to this collaborative discussion. In this open round-table session with community and Microsoft experts, you’ll explore architecture patterns for scaling PostgreSQL to meet the demands of agent-based and AI-driven applications. Discuss real-world strategies for handling vector embeddings in Postgres, unifying relational and document data, integrating with AI models, and more. Compare the trade-offs between different scaling approaches – from monolithic to microservices, sharding strategies, and new technologies like HorizonDB – and learn where each design shines or struggles in production. Come ready to share your experiences and learn from others in the room. ▶️ On-Demand: Smarter PostgreSQL Migrations to Power Modern, Intelligent Apps (OD822, 30 min, digital only) Planning to migrate to Postgres or move your databases to Azure? Start here. This on-demand session focuses on new tools and proven strategies to migrate large-scale databases to Azure Database for PostgreSQL quickly and safely. You’ll see AI-assisted migration tools in action that minimize downtime and risk when moving terabytes of data. Just as importantly, you’ll learn how migrating to Azure unlocks advanced capabilities – from boosted performance and enhanced security to AI-ready features – helping you turn your newly migrated data into intelligent apps and services. On-demand session will be available to stream on the first day of Build. Meet the team: PostgreSQL expert meetups If you’re attending Build in person, stop by the Expert Meetup (EMU) area and head to the relational cloud databases booth. This is your chance to talk directly with the engineers and product teams behind PostgreSQL on Azure. Bring your questions about architecture decisions, scaling patterns, migrations, AI workloads, or anything else on your mind. Whether you want to sanity-check a design, dig deeper into something you saw in a session, or give direct feedback, the EMU space is designed for exactly that convo. How to get the most out of Build (and what to do next) With so much great content lined up, how do you decide where to start? It really depends on what you’re most excited about: Curious about AI and agentic apps: Start with From Rows to Reasoning, then go deeper with the Simplify App Dev with HorizonDB demo or get hands-on at the Azure HorizonDB labs to see how these ideas work in practice. Performance and scale are your focus: The short Lightning Talk on HorizonDB’s cloud-native architecture and the Table Talk on scaling Postgres will both provide unique insights and pro tips for running Postgres at massive scale. Planning a migration to PostgreSQL on Azure: Watch the Smarter PostgreSQL Migrations on-demand session to learn how to migrate large workloads with minimal downtime, and the benefits you can unlock after moving to Azure. Looking for real answers to your specific questions: Make time for the PostgreSQL Expert Meetup area to connect directly with the team. No matter which sessions you choose, Build 2026 promises to be an exciting event for the PostgreSQL developer community. Browse the session catalog, save the sessions that match your interests, and we’ll see you at Build.482Views2likes0CommentsAzure HorizonDB: Enterprise-Ready Postgres, Engineered for the AI Era
Affan Dar, Vice President of Engineering, PostgreSQL at Microsoft Charles Feddersen, Partner Director of Program Management, PostgreSQL at Microsoft Today at Microsoft Build, we’re pleased to announce the public preview of Azure HorizonDB, a new enterprise-ready Postgres-compatible database service designed to meet the needs of modern AI applications, alongside a set of enhancements to our PostgreSQL tooling in Visual Studio Code to further streamline the developer experience. Postgres is rapidly solidifying its role as a foundational layer in modern data architectures, with accelerating adoption across industries. For developers, it has become the preferred platform for new application development, driven by its extensible architecture, mature extension ecosystem, and adherence to open standards and APIs. At the same time, enterprises are choosing Postgres to re-platform and modernize existing systems, taking advantage of its ability to support a broad range of operational workloads while enabling advanced capabilities such as vector-based data access all within a single, interoperable platform. A Postgres Platform Grounded in Security, Resilience, Scale, and Performance Azure HorizonDB is purpose-built to meet these demands, combining the flexibility developers expect from Postgres with the operational rigor enterprises require. It extends the core Postgres engine with cloud-native capabilities such as integrated identity, fine-grained network and security controls, and seamless lifecycle management, while preserving full compatibility with the open ecosystem of extensions and tools. At the same time, HorizonDB introduces advanced, natively integrated capabilities like vector data support and AI model management, enabling new classes of intelligent applications without sacrificing transactional integrity or developer productivity. These capabilities are backed by a platform designed for enterprise performance and scale. HorizonDB supports databases up to 128 TB, scales out with up to 15 read replicas for high-throughput workloads, and delivers sub-millisecond commit latency across availability zones for low-latency transactions and high availability. This combination is critical for modern applications that require consistent performance under load, including high-concurrency transactional systems, real-time AI-driven interactions, and globally distributed services. The result is a unified platform that scales from the first line of code to globally distributed, mission-critical systems. Enterprise adoption ultimately depends on trust in the platform itself. Azure HorizonDB delivers this with native integration into Microsoft Entra ID for centralized identity and access control, private endpoints for network isolation, and built-in encryption to protect data at rest and in transit. These capabilities are essential for meeting compliance requirements and enabling organizations to run mission-critical workloads with confidence, without added complexity. This foundation is critical for any application, but it becomes indispensable for AI, where secure access to data and controlled model interaction underpin every intelligent experience. Building on this, HorizonDB introduces a set of integrated AI capabilities designed to bring intelligence directly into the database. Run Fast, Memory-Efficient Vector Search with DiskANN HorizonDB brings high-performance vector search directly into Postgres through DiskANN with spherical quantization. This enables efficient, low-latency similarity search at scale while significantly reducing memory and storage overhead. Spherical quantization works by normalizing vectors and encoding them into compact representations that preserve angular distance, allowing the system to compare vectors efficiently with minimal loss in accuracy. The result is the ability to index and query large embedding datasets within the transactional engine itself, making vector search a first-class capability rather than an external dependency. "HorizonDB is compelling because it brings a PostgreSQL-compatible foundation, AI-native capabilities and enterprise-grade controls closer to the operational data layer." Jennings Balavari, Founder, Opsen AI Build Smarter Apps with Hybrid Search in Postgres HorizonDB supports hybrid search by combining vector similarity through pgvector with full-text search enabled via the pg_textsearch extension, allowing applications to match both semantic meaning and precise keyword relevance in a single query. This enables more accurate, context-aware results, such as blending intent-driven retrieval with exact term matching for search, recommendations, or RAG scenarios. By unifying these capabilities within Postgres, HorizonDB improves result quality while simplifying application design without the need for external search systems. Operationalize AI with Built-In AI Model Management Working with vectors requires models to generate, interpret, and evolve embeddings, making model lifecycle a core part of the application stack. HorizonDB introduces integrated AI model management to simplify how models are registered, versioned, and governed alongside data, including built-in support for generative GPT models and ranking models. For example, GPT models can be used to generate summaries, responses, or structured outputs directly from application data, while ranking models enable relevance scoring for search results or recommendations over vector results. By managing these models alongside the data they operate on, HorizonDB ensures consistency, traceability, and control, creating a unified environment where models and data evolve together. “As we build a multi-tenant, AI-driven commerce platform, HorizonDB has been particularly compelling in two areas: scale and how close AI capabilities are to the data itself. Running vector search, filtering, and model-driven workflows directly inside the database removes a lot of the complexity we’d normally manage across separate services." James Frawley, CIAO, ReFiBuy Bring AI into SQL with AI Functions With models managed in place, AI Functions provide a direct way to invoke them from within SQL and application logic. These functions are implemented through the azure_ai extension, which brings model invocation directly into the Postgres engine. This allows developers to embed inference into queries and transactions, eliminating the need for external orchestration. By bringing model execution closer to the data, AI Functions reduce latency, simplify application design, and make intelligent behavior a natural extension of existing Postgres workloads. "What stood out with HorizonDB is that it aligns closely with how we already think about the problem. Instead of stitching together multiple components, it brings transactional data, vector search, and AI capabilities into a single platform, which simplifies the architecture without forcing a complete rethink." Mohsin Shafqat, Director Software Engineering, Nasdaq Run Reliable, Event-Driven Workflows with AI Pipelines Finally, AI Pipelines operationalize these capabilities through reliable, event-driven workflows for model execution and data processing. Pipelines execute on data changes, enabling real-time asynchronous reactions without external orchestration and ensuring consistent, repeatable behavior as data evolves. Combined with model management and AI Functions, they turn embedded intelligence into something that can be run, scaled, and trusted in production, while inheriting the database’s high availability and failover characteristics for resilience. Pipelines can also be visualized and observed in real time through the Visual Studio Code extension for PostgreSQL, giving developers and operators immediate visibility into execution flow, state, and outcomes Modern Unified Experience for Data, AI, and Operations in VS Code As intelligence becomes a core part of the data platform, the developer and operator experience becomes equally critical. HorizonDB extends seamlessly into Visual Studio Code with enhanced PostgreSQL tooling that works across any Postgres deployment, not just HorizonDB. Features like AI-assisted query plans and integrated monitoring enable faster debugging and optimization, helping teams understand both database performance and AI-driven behaviors. At the same time, for Azure-based deployments, the experience is deeply integrated with platform capabilities, enabling management of networking configuration, server parameters, and server logs directly from the development environment, streamlining operations across application and infrastructure layers. Azure HorizonDB brings together enterprise-grade security, deep Postgres compatibility, and a modern AI-native data platform, all engineered for developers. It scales efficiently across workloads, from transactional systems to intelligent applications, while delivering a world-class, Azure-integrated experience in Visual Studio Code for both developers and operators. Ready to get started with Azure HorizonDB? Azure HorizonDB is now available in public preview in Australia East, Central US, Sweden Central, West US 2, and West US 3 regions. Additionally, East US, Canada Central, Indonesia Central, Italy North, Japan East, Korea Central, and Poland Central will be available in the coming weeks. You can get started today by creating a new HorizonDB instance using the Azure portal, API’s, or the Visual Studio Code extension for PostgreSQL to begin exploring these capabilities firsthand. To learn more, dive deeper into our documentation and sign-up today to try AI model management in a limited preview.2.3KViews7likes0CommentsDemystifying LWLock: MultiXact SLRU Wait Events in Azure Database for PostgreSQL
What Is the MultiXact Subsystem? Before diving into the wait events, it helps to understand what PostgreSQL is protecting. When a single transaction modifies a row, PostgreSQL stores that transaction's ID (XID) directly in the row header's xmax field. Simple and fast. But what happens when two or more transactions need to hold locks on the same row simultaneously? For example: Two sessions run SELECT ... FOR KEY SHARE on the same row, OR A SELECT ... FOR UPDATE session encounters a row already locked by a foreign key check (FOR KEY SHARE), OR Multiple sub-transactions created by SAVEPOINT all reference the same row In these cases, PostgreSQL cannot store two XIDs in a single xmax field. Instead, it creates a MultiXact ID (MXID) — a composite identifier that points to a list of all involved transaction IDs. These mappings are stored in two on-disk structures: pg_multixact/offsets/ — maps each MXID to its position in the member's file pg_multixact/members/ — stores the actual list of member XIDs To avoid reading these files from disk on every row visibility check, PostgreSQL caches them in memory using a SLRU (Simple Least Recently Used) cache — a small, fixed-size ring buffer with just 8 buffer slots by default (64 KB) for offsets and 16 slots (128 KB) for members. Each of those buffer slots is protected by an LWLock (Lightweight Lock). And that LWLock is exactly where the bottleneck hides. The Core Insight: Cache Hits and Lock Contention are not mutually exclusive This is the most important concept in this entire post: pg_stat_slru measures I/O misses. pg_stat_activity wait events measure lock contention. Here is what actually happens at the microsecond level: Session A: Needs MXID 5,000,023 → Page is in SLRU cache (blks_hit++) → LWLock is needed on buffer slot 3 → Session B already holds that LWLock → Session A WAITS → wait_event = 'MultiXactOffsetSLRU' recorded The cache is warm, the page is in memory, but what happens when 50 concurrent sessions suddenly all need that same buffer slot's LWLock? Only one request can hold the LWLock at a time. Every session that queues behind the lock holder generates a wait event record, all while no disk I/O is occurring. What Triggers MultiXact SLRU Wait Events? Below are the scenario's which could trigger MultiXact SLRU Wait Events. Thundering Herd — Mass Concurrent Row Locking When dozens or hundreds of sessions simultaneously attempt to lock the same small set of rows, PostgreSQL must create and resolve MXIDs at high frequency. Each resolution requires an SLRU lookup. When all lookups target the same few SLRU pages, the buffer lock becomes a serialization point. A classic reconnect avalanche triggered when the connection pool detected server slowness and all clients retried simultaneously. Foreign Key FOR KEY SHARE vs. FOR UPDATE Conflicts This is one of the most common hidden triggers. Every time a child row is inserted or updated, PostgreSQL implicitly acquires FOR KEY SHARE on the referenced parent row to prevent the parent from being deleted. If another session simultaneously holds FOR UPDATE on that same parent row, PostgreSQL must combine both lock modes into a new MXID: -- Session A (updater): SELECT * FROM orders WHERE id = 1 FOR UPDATE; -- Session B (child inserter — FK check): INSERT INTO order_items (order_id, amount) VALUES (1, 99.00); -- Implicitly: SELECT * FROM orders WHERE id = 1 FOR KEY SHARE; -- PostgreSQL creates MXID combining both transactions → SLRU write SAVEPOINT and Sub-transaction Churn Every SAVEPOINT creates a sub-transaction XID. ORM frameworks and JDBC drivers with autosave=always silently wrap every statement in a SAVEPOINT, generating sub-XIDs that accumulate rapidly. When these sub-XIDs reference the same rows across concurrent sessions, MXID creation accelerates. Each exception handler internally creates a SAVEPOINT, even when no explicit savepoint is declared. Idle Sessions Holding MXID References Open Long-running idle-in-transaction sessions hold their MXID references open, preventing PostgreSQL's autovacuum from cleaning up old multixact entries. As the MXID space grows without cleanup, the SLRU cache (fixed at 8 pages) cannot hold all active pages simultaneously. New MXID lookups begin evicting cached pages, and under concurrency the eviction/reload cycle itself creates LWLock contention. VACUUM FREEZE During High Concurrency When autovacuum runs VACUUM FREEZE to prevent MXID wraparound, it reads and resolves every MXID on every row in the table. During this process it holds SLRU buffer locks for each lookup. Concurrent application sessions needing the same SLRU pages must wait; generating MultiXactOffsetSLRU wait events even when the underlying cause is a routine maintenance operation. How to Diagnose This in Real Time Step 1: Confirm Active Wait Events Start by querying pg_stat_activity to confirm which sessions are actively waiting on MultiXact SLRU events. Use \watch 1 to refresh every second since SLRU LWLocks are held for microseconds and brief spikes can be missed. Pay close attention to the wait_event column, MultiXactOffsetSLRU and MultiXactMemberSLRU indicate pure lock contention on cached pages, while MultiXactOffsetBuffer indicates the page was evicted from cache and is being reloaded from disk. SELECT pid, wait_event_type, wait_event, state , now() - query_start AS duration, left(query, 100) AS query FROM pg_stat_activity WHERE wait_event IN ( 'MultiXactOffsetSLRU', 'MultiXactMemberSLRU', 'MultiXactOffsetBuffer' ) ORDER BY duration DESC; Step 2: Distinguish Lock Contention From Disk I/O Once wait events are confirmed, determine whether the root cause is cache under sizing (disk I/O problem) or LWLock queue depth (lock contention problem). These look identical in pg_stat_activity but require completely different fixes. Query pg_stat_slru and focus on the blks_read and blks_zeroed columns. A blks_read = 0 alongside active wait events is the definitive signature of pure lock contention: the cache is warm but the LWLock protecting each buffer slot is the bottleneck. SELECT name, blks_hit, blks_read, blks_zeroed, flushes , ROUND(100.0 * blks_hit / NULLIF(blks_hit + blks_read, 0), 2) AS hit_pct FROM pg_stat_slru WHERE name LIKE 'MultiXact%'; Interpretation Meaning blks_read = 0 AND wait events are present Pure lock contention — the cache is warm, the lock is the bottleneck blks_read > 0 AND wait events are present Cache miss + lock contention — the SLRU cache is undersized blks_zeroed are increasing rapidly High MXID creation rate — inspect application pattern issues flushes are high (2,000+) Frequent checkpoint SLRU flushes holding an exclusive lock Step 3: Check MXID Age and Identify Hot Tables A high MXID age means autovacuum is not keeping pace with MXID creation. This causes the MXID space to span far more SLRU pages than the 8-slot cache can hold, directly worsening lock contention. Check MXID age at both the database and table level. Any database approaching 1.5 billion should be treated as a critical incident. At the table level, focus on tables with the highest mxid_age combined with high row counts and stale last_autovacuum timestamps — these are your primary VACUUM FREEZE targets. -- Database-level MXID age (watch for values > 1 billion) SELECT datname, mxid_age(datminmxid) AS mxid_age FROM pg_database ORDER BY mxid_age DESC; -- Table-level — find the worst offenders SELECT relname, mxid_age(relminmxid) AS mxid_age , pg_size_pretty(pg_total_relation_size(oid)) AS table_size, n_live_tup FROM pg_stat_user_tables t JOIN pg_class c USING (relname) WHERE c.relkind = 'r' ORDER BY mxid_age DESC LIMIT 10; Step 4: Find Queries Driving MXID Creation Once the hot tables are identified, use pg_stat_statements to pinpoint the specific queries generating MXIDs at a high rate. Focus on queries using FOR UPDATE, FOR KEY SHARE, or FOR SHARE, as these are the direct triggers for concurrent row locking. Also watch for INSERT-heavy workloads on child tables with FK references, as these implicitly acquire FOR KEY SHARE on parent rows and are a common hidden driver of MXID creation. SELECT left(query, 200) AS query, calls, mean_exec_time::numeric(10,2) AS avg_ms FROM pg_stat_statements WHERE query ILIKE '%for update%' OR query ILIKE '%for key share%' OR query ILIKE '%for share%' ORDER BY calls DESC LIMIT 10; Step 5: Catch Long-Running Idle Transactions Idle-in-transaction sessions are one of the most damaging contributors to SLRU contention. They hold row-level locks while doing nothing, preventing autovacuum from cleaning up MXIDs and forcing other sessions to queue on the same SLRU pages indefinitely. Query pg_stat_activity filtered to idle-in-transaction states and focus on sessions idle for more than 2 minutes; these are almost always stuck client tool connections (DBeaver, pgAdmin) or misbehaving application threads that opened a transaction and never committed. SELECT pid, usename, application_name, state, now() - xact_start AS txn_age , now() - state_change AS idle_duration, left(query, 100) AS last_query FROM pg_stat_activity WHERE state IN ('idle in transaction', 'idle in transaction (aborted)') AND now() - state_change > interval '2 minutes' ORDER BY idle_duration DESC; Fixes and Mitigation Strategies Terminate long-running idle-in-transaction sessions The fastest way to break the contention cycle during an active incident is to terminate sessions holding locks while idle. This immediately releases row-level locks, unblocks autovacuum, and allows MXID cleanup to resume. Always preview before terminating to avoid disrupting legitimate long-running batch jobs. -- Preview candidates first SELECT pid, usename, application_name, now() - state_change AS idle_duration FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - state_change > interval '5 minutes' ORDER BY idle_duration DESC; --then run the terminate SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - state_change > interval '5 minutes'; Set timeouts to prevent recurrence After immediate relief, set timeouts to automatically kill idle-in-transaction sessions before they accumulate again. Both parameters take effect immediately via pg_reload_conf() with no server restart required. Choose values appropriate to your workload, use more aggressive for high-concurrency OLTP, more lenient for batch or reporting workloads. ALTER SYSTEM SET idle_in_transaction_session_timeout = 'XXmin'; ALTER SYSTEM SET statement_timeout = 'XXmin'; SELECT pg_reload_conf(); VACUUM FREEZE the most affected table Running VACUUM FREEZE on the tables identified in Step 3 immediately clears old MXIDs from row headers, replacing them with frozen values that never require SLRU lookups. This directly reduces SLRU page access and LWLock contention. For very large tables, use INDEX_CLEANUP FALSE on the first pass for speed, then run a full vacuum during off-peak hours. VACUUM FREEZE VERBOSE your_hot_table; Connection pooling Connection pooling is the most impactful long-term fix for preventing thundering herd connection storms. PgBouncer in transaction mode absorbs connection spikes at the pooler layer, ensuring PostgreSQL never sees more simultaneous sessions than the pool size, and directly reducing the number of concurrent MXID lookups and SLRU LWLock contention. On Azure PostgreSQL Flexible Server, PgBouncer is available as a built-in feature. Simply enable it via Server Parameters (pgbouncer.enabled = ON) and connect on port 6432. No separate installation is required. # Key PgBouncer settings for OLTP workloads pool_mode = transaction # Release connection after each transaction default_pool_size = 100 # Real PG connections per database/user pair max_client_conn = 5000 # Total clients PgBouncer will accept Conclusion MultiXact SLRU wait events are deceptive. A server can show 100% cache hit ratio, nominal disk I/O, and moderate CPU, all while simultaneously stalling hundreds of sessions on a lock held for microseconds. The cache appears healthy, and the problem is invisible... until sessions start piling up. The single most important diagnostic insight is this: pg_stat_slru measures I/O efficiency. pg_stat_activity measures lock contention. A 100% hit ratio does not mean there are no problems, it means the bottleneck is the lock protecting the cached page, not the cache itself. Once you know this, the path forward is clear. Use pg_stat_slru and pg_stat_activity together, never in isolation. Track mxid_age proactively, and do not wait for autovacuum to raise the alarm. Eliminate idle-in-transaction sessions before they become lock holders. Use the weakest lock mode that satisfies your application's consistency requirements. Use FOR NO KEY UPDATE instead of FOR UPDATE wherever possible. And deploy PgBouncer (available as a built-in feature on Azure Database for PostgreSQL Flexible Server) to absorb connection storms before they reach the SLRU layer. MultiXact contention is not inevitable. It is the compounded result of concurrent locking patterns, under-tuned autovacuum, and unbounded connection growth, all of which are fully controllable. The earlier you instrument for it, the less likely it becomes an incident.128Views0likes0CommentsReal-World Success Stories with PostgreSQL on Azure
Organizations rarely leap into cloud migrations or AI-powered systems overnight. They progress in deliberate stages, establishing a reliable data foundation, optimizing for performance, and then accelerating innovation. Across healthcare, financial services, and AI startups, companies are navigating this journey on Azure Database for PostgreSQL: a fully managed, enterprise-ready PostgreSQL environment with 58% lower total cost of ownership (TCO) compared to on-premises deployments. This post walks through real customer stories that span the full arc, from lift-and-shift migration to production-grade AI agent development, illustrating how Azure Database for PostgreSQL supports scalability, performance, security, and AI-readiness at every stage. Migrating with Confidence: Apollo Hospitals & August AI Apollo Hospitals operates a network of more than 74 hospitals and needed to move beyond a legacy on-premises Oracle system that had become difficult to manage and couldn't keep pace with growing data volumes. IT teams were spending their time on maintenance rather than innovation. Apollo migrated its core hospital information system backend to Azure Database for PostgreSQL. Working with partner Quadrant Technologies, the team lifted and shifted critical applications while using Azure DevOps to orchestrate CI/CD pipelines and Azure Application Insights for telemetry and observability. The results: 99.95% availability across hospital systems Database transactions executing within 5 seconds 40% reduction in deployment times via modern CI/CD pipelines Decreased operational overhead, freeing IT staff for higher-value work With a stable, scalable PostgreSQL backend in place, Apollo is now exploring real-time analytics and AI-enabled tools like Microsoft 365 Copilot to advance patient care. "We saw Azure Database for PostgreSQL as the right foundation for the future. It's open, cost-effective, and capable of supporting the hospital information system we built in-house." — Shankar Krishna A., General Manager of IT, Apollo Hospitals Apollo's experience is not unique. August AI, a healthcare-tech startup offering an AI-driven medical companion, migrated its entire stack to Azure—with Azure Database for PostgreSQL storing mission-critical patient data while meeting strict compliance requirements such as HIPAA. The result: scaling from roughly 500,000 users to 3.5 million+ users worldwide, with zero downtime during the cutover, completed in just three months. As Founder and CEO Anuruddh Mishra noted: "We receive a log of queries that are not performing optimally, and within a couple of minutes we can optimize that query with PostgreSQL on Azure and move on". Modernizing at Scale: Nasdaq Migration is often the first step. Nasdaq demonstrates what becomes possible when organizations modernize their architecture on a scalable data foundation. To improve its Nasdaq Boardvantage platform—used by corporate boards to collaborate on governance documents—Nasdaq re-architected on Azure. The team containerized services with Azure Kubernetes Service (AKS) and adopted Azure Database for PostgreSQL alongside Azure Database for MySQL as persistent data stores for governance workloads. This architecture provided the flexibility, performance, and security required for a multitenant platform handling sensitive board materials. With the data layer in place, Nasdaq integrated Microsoft Foundry and Azure OpenAI to deliver AI-powered summarization and workflow automation. The measurable outcomes: 60% reduction in reading time through AI-powered document summarization 25% decrease in administrative preparation time across board workflows Up to 97% accuracy in AI-generated summaries and meeting minutes A reusable AI framework established for future extensibility "Both Azure Database for PostgreSQL and Azure Database for MySQL gave us the right balance of performance, security, and control. The governance workloads we handle are unique, so we needed something that could meet those isolation and encryption requirements." — Scott Ellison, Vice President of Technology, Nasdaq Building Intelligent Applications: SubgenAI and OpenAI Azure Database for PostgreSQL now supports native vector search via pgvector, high-performance DiskANN indexing, semantic operators and AI model management, and integrated graph capabilities for relationship reasoning—making it a production-ready foundation for intelligent applications. SubgenAI, a European generative AI company, built its flagship platform Serenity Star on Azure Database for PostgreSQL and Microsoft Foundry to transform AI agent development from a code-heavy, fragmented process into a streamlined, no-code experience. A core technical requirement: the platform's retrieval-augmented generation (RAG) system needs efficient vector search against embedded content while maintaining enterprise-grade reliability. After evaluating several database options, SubgenAI chose Azure Database for PostgreSQL with pgvector for its accurate and scalable vector similarity search. Serenity Star customers can now: Launch AI agents in as little as 15 minutes Cut coding and development time by 50% Resolve most AI agent queries in under 60 seconds [ "With Microsoft and Azure Database for PostgreSQL we have total control and an environment that is truly dynamic and can adapt to the evolution we're looking for." — Julia Schröder Langhaeuser, VP of Product Serenity Star, SubgenAI At the extreme end of scale, OpenAI runs PostgreSQL on Azure to support production systems behind ChatGPT. As write scalability limits emerged on an initially unsharded single primary instance, OpenAI offloaded write-heavy operations to other systems and optimized read workloads using PgBouncer for connection pooling. The Azure Database for PostgreSQL team responded by developing the elastic clusters feature, enabling horizontal scaling through row-based and schema-based sharding. The team reduced connection latency from approximately 50 ms to under 5 ms, scaled reads horizontally with multiple replicas, and improved reliability by prioritizing critical requests—all achieved by a small team making systematic optimizations on open-source PostgreSQL. "After all the optimization we did, we are super happy with Postgres right now for our read-heavy workloads. It's really scalable and reliable." — Bohan Zhang, Member of the Technical Staff, OpenAI Meeting You Where You Are Beyond these stories, organizations like BMW Group (cloud-native applications at global scale), Ahold Delhaize (highly available retail applications), Mott MacDonald (an AI agent accelerating onboarding and spreading best practices across 220,000 employees), and Multitude (scaling responsibly in regulated environments) all run on Azure Database for PostgreSQL. The service offers 99.99% availability with automatic failover and SLA, independent compute and storage scaling, and intelligent performance recommendations, available across 60+ Azure regions. Developer tooling including the PostgreSQL extension for Visual Studio Code with GitHub Copilot further accelerates productivity. Whether you are planning your first migration or building production AI agents, these stories share a clear signal: Azure Database for PostgreSQL delivers a scalable, secure, AI-ready data foundation at every stage of growth. Explore full customer stories in depth in the eBook: Customer Success Stories with Azure Database for PostgreSQL.129Views1like0CommentsUltimate Guide to POSETTE: An Event for Postgres, 2026 edition
POSETTE: An Event for Postgres 2026 is back for its 5th year: free, virtual, and unapologetically all about Postgres. No travel budget required and no jet lag involved. Just your laptop, a decent internet connection, and curiosity. This year the POSETTE 2026 schedule has 4 livestreams (16-18 June) with 44 talks at ~25 minutes each—covering everything from query performance and partitioning to Postgres 19 features, extensions, and use cases. Which is awesome but also a bit of work to figure out which talks are for you. Hence this ultimate guide post. Every talk will land on YouTube afterward (un-gated, of course) so if you miss anything you care about, you can watch it later. But if you can catch a livestream in June, do it. That’s when the “virtual hallway track” happens on Discord—where you can ask the POSETTE speakers questions and compare notes with other attendees. Meeting other attendees who have the same weird Postgres problems you do can be reassuring somehow. And yes, there will be swag. This guide is your cheat sheet: I’ve categorized and tagged all 44 talks so you don’t have to read 44 abstracts back-to-back. In this post you'll get: “By the numbers” summary Map of the 44 talks 2 Keynote sessions 23 Postgres core talks 11 Postgres ecosystem talks 8 Azure Database talks Why participate on the virtual hallway track on Discord A big thank you to our amazing speakers Join us for POSETTE 2026 & mark your calendars Official POSETTE 2026 Trailer “By the numbers” summary for POSETTE 2026 Here’s a quick snapshot of what you need to know about POSETTE this year: 3 days 16-18 June 2026 4 livestreams In Americas & EMEA time zones but of course you can watch from anywhere 44 talks All free, all virtual 2 invited keynotes Driving Postgres forward at Microsoft (Livestream 1), and Postgres 19 Hackers Panel: What’s In, What’s Out, & What’s Next (Livestream 2) 25 minutes Average length per talk ~1100 minutes Total minutes in POSETTE 2026 talks 50 speakers POSETTE 2026 speakers include PostgreSQL hackers and contributors, users, application developers, PG community members, Azure engineers, & Azure customers 6 keynote speakers Affan Dar & Charles Feddersen (Livestream 1); and Álvaro Herrera, Heikki Linnakangas, Melanie Plageman, & Thomas Munro (Livestream 2) 19 countries Speakers reside in 19 different countries 23 companies Speakers hail from 23 different companies 17.6% CFP acceptance rate 42 talks selected from 238 submisssions 75% general Postgres talks 33 talks are not cloud-specific at all, they’re about the Postgres technology & ecosystem 25% Azure-related talks 11 of 44 talks feature Azure Database for PostgreSQL or Azure HorizonDB 1 organizing company Organized by the Postgres team at Microsoft, in partnership with AMD 17 languages Published talk videos will have captions available in 17 languages, including English, Czech, Dutch, French, German, Hindi, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish, Turkish, Ukrainian, and Chinese Simplified & Chinese Traditional Map of the 44 talks To help you quickly navigate all 44 talks, here’s a map of the high-level categories and detailed topics. : A map of the POSETTE 2026 talks—high-level categories and detailed tags to help you find what you care about 2 Keynote sessions Affan Dar and Charles Feddersen lead the PostgreSQL engineering and product teams at Microsoft, In this keynote, they’ll walk through how Microsoft is contributing to Postgres, both upstream in the open source project and in the cloud database service they build on top of it. Driving Postgres forward at Microsoft, by Affan Dar & Charles Feddersen (Azure Database for PostgreSQL, Azure HorizonDB, VS Code, Dev tools, community, Postgres hacking, open source, PosetteConf, livestream-1) Want to understand how Postgres features get decided? This keynote panel with 4 PostgreSQL committers & hackers will peel back the curtain. You’ll hear what made it into Postgres 19, what didn’t (and why), and get a sneak peek into a few of the things in the oven for Postgres 20. Postgres 19 Hackers Panel: What’s In, What’s Out, & What’s Next, by Álvaro Herrera, Heikki Linnakangas, Melanie Plageman, & Thomas Munro (Postgres 19, Postgres hacking, panel, open source, collaboration, multithreading, livestream-2) 23 Postgres core talks Data Modeling JSON in PostgreSQL - evil data type or just needs to be tamed?, by Boriss Mejias (JSON, performance, data modeling, livestream-1) PostgreSQL Design Patterns, by Chris Ellis (data modeling, SQL, PG use cases, livestream-1) Graph Data Exploring property graphs with SQL/PGQ in PostgreSQL, by Ashutosh Bapat (SQL/PGQ, graph data, data modeling, Postgres 19, livestream-4) LISTEN/NOTIFY LISTEN Carefully: How NOTIFY Can Trip Up Your Database, by Jimmy Angelakos (LISTEN/NOTIFY, PG use cases, triggers, livestream-4) Performance Maintaining Large Tables in PostgreSQL, by Sarat Balijepalli (WAL, performance, scaling Postgres, vacuum, autovacuum, statistics, partitioning, monitoring, livestream-3) My Postgres partitioning cookbook, by Derk van Veen (partitioning, PG use cases, data modeling, performance, livestream-4) PostgreSQL 17 vs 18: Side‑by‑Side Performance Wins in Real‑World Queries, by Divya Bhargov (performance, PG use cases, livestream-3) Vacuuming Enhancements in PostgreSQL 18: Faster, Smarter, More Predictable, by Shashikant Shakya (vacuum, async IO, monitoring, performance, livestream-4) PG Internals Linux and PostgreSQL in the Multiverse of Connections, by Josef Machytka (Linux, PG internals, connection pooling, livestream-2) pg_stats: How Postgres Internal Stats Work, by Richard Yen (statistics, pg_stats, PG internals, query planner, livestream-2) Postgres isn’t slow, your storage is, by Sai Srirampur (storage, IO, performance, livestream-3) PostgreSQL queues done right with PgQ, by Alexander Kukushkin (queues, PG internals, extensions, livestream-2) random_page_cost in Postgres - why the default is 4.0 and should you lower it?, by Tomas Vondra (PG internals, IO, performance, livestream-1) The Wonderful World of WAL, by Bruce Momjian (WAL, PG internals, replication, livestream-3) What's new with constraints in Postgres 18, by Gülçin Yıldırım Jelínek (constraints, data modeling, livestream-2) Postgres Hacking Fuzzing PostgreSQL, by Adam Wolk (PG internals, testing, Dev tools, libpq, security, livestream-1) Journey of developing a performance optimization feature in PostgreSQL, by Rahila Syed (Postgres hacking, PG internals, performance, WAL, replication, livestream-4) The Hitchhiker’s Guide to PostgreSQL Hacking: Don’t Panic, Just Start Small, by Xuneng Zhou (Postgres hacking, PG internals, community, livestream-2) Replication Past, Present, and Future: Logical Decoding and Replication in PostgreSQL, by Hari Kiran (replication, logical decoding, PG internals, livestream-4) Where Does My INSERT Go? A Logical Replication Story, by Hamid Akhtar (replication, PG internals, WAL, livestream-4) Security From Dev to Prod: Securing Postgres the Right Way, by Sakshi Nasha (security, roles, PG use cases, extensions, monitoring, livestream-4) From trust to Tokens: A Short History of PostgreSQL Authentication, by Murat Tuncer (authentication, security, livestream-2) PostgreSQL vs. SQL Server: Security Model Differences, by Taiob Ali (security, authentication, SQL Server, roles, livestream-1) 11 Postgres ecosystem talks Analytics pg_lake: Postgres as a lakehouse, by Marco Slot (pg_lake, extensions, OLAP, data warehouse, Iceberg, DuckDB, analytics, livestream-2) Apache AGE Querying & Visualizing Graphs in Postgres with Apache AGE, by Christian Miles (Apache AGE, graph data, data visualization, SQL/PGQ, Azure HorizonDB, livestream-1) Autotuning Building safety tooling for risk-free AI tuning of Postgres: Fast cars need fast brakes, by Mohsin Ejaz (autotuning, AI, performance, monitoring, livestream-2) Change Data Capture Building Event-Driven Systems with PostgreSQL Logical Replication and Drasi, by Diaa Radwan (Drasi, replication, WAL, CDC, livestream-3) Citus Move Less, Move Faster: Speeding Up Citus Cluster Scaling, by Muhammad Usama (Citus, extensions, performance, scaling Postgres, livestream-4) Dev Tools An MCP for your Postgres DB, by Pamela Fox (MCP, AI, Python, Dev tools, livestream-1) pgcov: Bringing Real Test Coverage to PostgreSQL Code, by Pavlo Golub (testing, Postgres hacking, Dev tools, extensions, CI/CD, livestream-3) PostgreSQL Tooling Across AI Editors and Agents, by Matt McFarland (Dev tools, VS Code, Cursor, AI, data visualization, Apache AGE, graph data, Azure, MCP, Copilot, livestream-1) Django PostgreSQL Generated Columns by Example, by Paolo Melchiorre (app dev, Django, generated columns, livestream-2) Kubernetes Quorum-Based Consistency for Cluster Changes with CloudNativePG Operator, by Jeremy Schneider & Leonardo Cecchi (CloudNativePG, Kubernetes, PG use cases, livestream-3) Performance Modelling Postgres Performance Degradation on Burstable Cloud Instances, by Chun Lin Goh (performance, burstable, compute, QA, livestream-4) 8 Azure Database for PostgreSQL & Azure HorizonDB talks AI-related talks From Queries to Agents: The Next Era of Data Retrieval on PostgreSQL, by Abe Omorogbe (AI, MCP, Azure Database for PostgreSQL, graph data, Apache AGE, Azure HorizonDB, livestream-3) Production RAG at Scale with Azure Database for PostgreSQL, by Julia Schröder Langhaeuser & Paula Santamaría (Azure Database for PostgreSQL, AI, RAG, PG use cases, livestream-3) AMD Choose the Right Azure Infrastructure to Improve Postgres Performance by Over 60%, by Andrew Ruffin (AMD, performance, Azure, compute, Azure Database for PostgreSQL, livestream-1) Azure HorizonDB Why we built Azure HorizonDB for PostgreSQL, by Dingding Lu (Azure HorizonDB, scaling Postgres, livestream-3) Flexible Server pg_duckdb in Action: Accelerating Analytics on Azure Database for PostgreSQL, by Nitin Jadhav (DuckDB, Azure Database for PostgreSQL, extensions, OLAP, analytics, performance, livestream-4) The Rise of PostgreSQL as the Everything Database, by Varun Dhawan (Postgres history, extensions, graph data, Apache AGE, Azure Database for PostgreSQL, DuckDB, Citus, livestream-3) What I’ve Learned Teaching Postgres to 200+ field engineers at Microsoft, by Paula Berenguel (training, Azure, Postgres skilling, livestream-1) Oracle to Postgres Migrating VLDBs from Oracle to Azure Database for PostgreSQL, by Adithya Kumaranchath (migration, Azure Database for PostgreSQL, Oracle to Postgres, livestream-2) Why participate in the virtual hallway track on Discord If you’ve checked out the schedule and plan to watch some of the talks, you might still be wondering: why join live—and why bother with the virtual hallway track on Discord? Here’s how a few of last year’s attendees described the experience: “Very impressed by all the speakers and content I am absolutely shattered as there was so much great content in all the talks over the past 3 days but I have probably learnt more in these sessions than I could have in months of reading up.” “Want to let y’all know how much I got from this onine conference, the speakers were excellent, well-prepared and well-presented. The hosts were informative, engaging, & amusing. The discord hallway channel made me feel connected. I learned a lot and found some new inspiration. I’ll be back next year!” “I have no idea how I’m going to summarise all the interesting stuff for coworkers.” The common thread: the live, shared experience—being able to ask questions, compare notes, and learn alongside other people in real time. How to join the virtual hallway track Head to the #posetteconf channel on Discord (on the Microsoft Open Source Discord) That’s where speakers and attendees hang out during the livestreams—it’s where you can ask questions, share reactions, and just say hi Big thank you to our amazing speakers Every great event starts with great talks—and great talks start with great speakers. Want to learn more about the people behind these talks? Visit the POSETTE 2026 Speaker page Click a speaker’s bio to see their written interview (if available) If a speaker has been a guest on the Talking Postgres podcast in the past, then you’ll find a link to their episode there, too Join us for POSETTE 2026! Mark your calendars I hope you join us for POSETTE 2026. Consider yourself officially invited. As part of the talk selection team, I’m definitely biased—but I truly believe these speakers and talks are worth your time. I’ll be hosting Livestream 1 and you’ll find me in the #posetteconf Discord chat. I hope to see you there. And please: tell your Postgres friends, so they don’t miss out! 🗓️ Add the livestreams to your calendar Livestream 1: Tue 16 June, 8am–2pm PDT (UTC-7) [ register for updates ] and/or [ add to calendar ] Livestream 2: Wed 17 June, 8am–2pm CEST (UTC+2) [ register for updates ] and/or [ add to calendar ] Livestream 3: Wed 17 June, 8am–2pm PDT (UTC-7) [ register for updates ] and/or [ add to calendar ] Livestream 4: Thu 18 June, 8am–2pm CEST (UTC+2) [ register for updates ] and/or [ add to calendar ] Watch last year’s POSETTE 2025 talks in advance: And if you want to get ready, you can watch talks from the POSETTE 2025 playlist on YouTube anytime, anywhere. Lots of solid, useful, and evergreen Postgres talks in there. “Official Trailer” for POSETTE 2026 is on YouTube To help more developers, community members, and Postgres users discover POSETTE 2026, our team created this short video trailer. Take a peek and share it with friends as an invitation of sorts. We’re trying to make sure that people don’t miss their opportunity to be part of the livestreams and ask questions on the discord during the conference (as well as watch the talks on YouTube after the event is over.) Watch and share the trailer: Official Trailer for POSETTE: An Event for Postgres 2026 Acknowledgements & Gratitude I’ve already thanked the 50 amazing speakers above. In addition, thanks go to Silvano Coriani, Cornelia Biacsics, Aaron Wislang, and My Nguyen for reviewing parts of this post before publication. I also want to thank the team at AMD for their partnership and support of POSETTE this year! And of course, big thank you to the POSETTE 2026 organizing team and POSETTE talk selection team—without you, there would be no POSETTE! Figure 3: Visual invitation to join the virtual hallway track for POSETTE 2026 on the Microsoft Open Source Discord, so you can chat with the speakers & others in the Postgres community525Views3likes0CommentsEnd-to-end workload observability with Query Store for primary and replicas
Query performance doesn’t stop at the primary Most PostgreSQL architectures don’t run on a single node anymore. Reads get offloaded. Replica chains grow. And when performance issues hit, the hardest part is often simple: where did the queries actually run? With the latest query store capabilities in Azure Database for PostgreSQL flexible server, you can now capture workload executed not just on the primary, but also on read replicas—including cascading read replicas—and export the captured runtime stats, wait stats, and query text into Azure Monitor Logs (Log Analytics workspace / LAWS). See the real hotspot: isolate which node (primary vs replica) is slow. Know why: break down time by waits (CPU, I/O, locks) per query. Connect the dots: correlate query IDs to query text, and inspect sampled parameters locally in azure_sys on the primary when you need input context (parameters aren’t exported to LAWS). Centralize analysis: query everything with KQL in LAWS, across servers. What you’ll build This post walks through a reproducible demo that provisions a primary server, a read replica, and a cascading read replica, then runs a TPC-H–based workload across all three to generate query store data you can analyze locally and in Log Analytics. Enable query store capture (including query text) and parameter sampling for parameterized queries. Enable wait sampling so query store can record wait statistics. Export runtime stats, wait stats, and SQL text to LAWS using resource-specific tables. Validate capture on read replicas and cascading read replicas (not just the primary). Prerequisites Azure CLI logged in (az login) and permission to create a resource group, Log Analytics workspace, and PostgreSQL flexible servers. psql and curl available on your machine. PostgreSQL flexible server on General Purpose or Memory Optimized tier (query store and replicas aren’t supported on Burstable). PostgreSQL 14+ to test out cascading replicas. Networking: the script opens firewall access broadly for demos—tighten for production. Architecture (primary + replica chain + LAWS) You’ll deploy four resources: Primary server: read/write node. Read replica (level 1): read-only node created from the primary. Cascading read replica (level 2): read-only node created from replica level 1. Log Analytics workspace (LAWS): central place to query Query Store telemetry across all nodes. If Diagnostic Settings is properly configured, each server streams query store telemetry to LAWS—but how it’s kept locally differs by role. On the primary, query store data is recorded in-memory, then persisted locally in the azure_sys database, and then exported to LAWS. On read replicas (including cascading replicas), query store data is recorded in-memory only and then exported to LAWS. Bottom line: use LAWS for fleet-wide visibility, and use the primary’s azure_sys when you need deep local inspection (like parameter samples). Deploy the demo environment The fastest way to reproduce the scenario is to run the end-to-end bash script which you can download from https://raw.githubusercontent.com/Azure-Samples/azure-postgresql-query-store/refs/heads/main/may2026/script/query_store_demo.sh Save the file to a local directory in your Linux shell, and name the file query_store_demo.sh. To invoke the script, at minimum, you must assign a string password for the administrator login of the instances of the flexible servers it creates, and invoke the script like this: ADMIN_PASSWORD=<Your_Strong_Password> ./query_store_demo.sh Optionally, you can also override default values for other environment variables used by the script: Variable Purpose Default SUBSCRIPTION_ID Azure subscription ID to use (current default subscription) BASE_NAME Base name for all resources (used in naming servers, resource groups, etc.) pgqswait{YYYYMMDDHHMMSS} RESOURCE_GROUP Azure resource group name rg-{BASE_NAME} LOCATION Azure region for resources southeastasia PRIMARY_SERVER Name of primary PostgreSQL server {BASE_NAME}-primary REPLICA_1 Name of first-level read replica {BASE_NAME}-readreplica REPLICA_2 Name of second-level cascading read replica {BASE_NAME}-cascadereadreplica LOG_ANALYTICS_WORKSPACE Log Analytics workspace name law-{BASE_NAME} LOG_ANALYTICS_LOCATION Azure region for Log Analytics workspace southeastasia ADMIN_USER PostgreSQL admin username pgadmin ADMIN_PASSWORD PostgreSQL admin password (REQUIRED) SKU_NAME PostgreSQL server SKU (compute tier) Standard_D4ds_v5 TIER PostgreSQL pricing tier GeneralPurpose STORAGE_SIZE Storage size in GB 64 VERSION PostgreSQL version (minimum 14 for cascading replicas) 17 PRIMARY_DATABASE Initial database name postgres SQL_BASE_URL Base URL for downloading SQL scripts https://raw.githubusercontent.com/Azure-Samples/azure-postgresql-query-store/refs/heads/main/may2026/script/query_store_demo.sh TPCH_DDL_URL URL for TPC-H schema DDL file {SQL_BASE_URL}/schema/tpch_ddl.sql WORKLOAD_REPETITIONS Number of times to execute each workload query (minimum 5) 10 AUTO_APPROVE Skip confirmation prompt and proceed automatically false If, for example, you want to not only pass the ADMIN_PASSWORD but also override the LOCATION, you could do it like this: ADMIN_PASSWORD=<Your_Strong_Password> LOCATION=canadacentral ./query_store_demo.sh In a bit over 1 hour, the script will do the following steps: Step 1 — Provision first part of the infrastructure The infrastructure provisioned in this phase consists of: A resource group in which all resources are deployed. An instance of Log Analytics workspace, where all flexible server instances will send their query store related logs. A primary (read-write) flexible server. Step 2 — Configure primary server Now it's time to configure one new server parameters on your primary server so that query store emits query text to LAWS, so that we can correlate quey IDs to something recognizable. Query IDs are great for aggregation—but you still need the SQL. Turn on query text emission so you can correlate runtime and waits back to the actual statement text. Do this by setting pg_qs.emit_query_text to on. Refer to our documentation to learn how to set the value of a server parameter. Step 3 — Provision second part of the infrastructure The infrastructure provisioned in this phase consists of: A read replica (read-only) whose source is the primary server. A cascade read replica (read-only), whose source is the previously created read replica. Notice that when read replicas are created, they inherit the server parameter values from their source server. Because we have configured query store related settings on the primary server already, the intermediate read replica inherits its server parameters from that primary, and the cascade read replica inherits them from the intermediate replica. Step 4 — Export query store to Log Analytics (LAWS) Now for the payoff, we want to stream the data to Log Analytics so you can query across nodes, build dashboards, and alert. The script configures diagnostic settings on the primary and both replicas to send logs to a Log Analytics workspace using resource-specific tables. This is the key to cross-node visibility: each server exports its own captured telemetry, and you can slice by resource in a single KQL query. Query store runtime stats: execution counts, elapsed time, and other performance counters. Query store wait stats: wait breakdown attributed to queries. Query store SQL text: query text to decode query IDs. Note: Query store parameter samples are not included in the Log Analytics export. Parameters are stored locally per server in azure_sys, and on read replicas azure_sys is read-only—so don’t depend on replicas for parameter inspection. LAWS receives runtime stats, wait stats, and query text. Diagnostics settings for an instance of flexible server can be configured via portal. In the resource menu, under Monitoring, select Diagnostic settings. Add a new diagnostic setting, select a destination Log Analytics workspace, and the individual log catergories which you want to stream to that LAWS, and save the changes. For Destination table it's highly recommended to use Resource specific (one table per signal with proper schema) over Azure diagnostics (legacy one table for everything). With Azure diagnostics, all logs from all resource types land into a single table (AzureDiagnostics). It's a wide table with many columns. New columns get added as services emit new fields. If the 500 column limit is hit, extra fields go into the AdditionalFields column (a dynamic JSON). Querying on attributes stored in that column might have huge performance and query cost impact. The schema is inconsistent and difficult to discover. You must always filter events in that table by ResourceType and Category. On the other hand, with Resource specific, logs are written to separate tables per resource type and category. Therefore, each table has a well-defined schema and columns are strongly typed. Tables are smaller and faster to query. Queries on these tables are simpler don't need filtering by ResourceType and Category. Performance-wise, they also support faster ingestion and faster querying. They also support selecting different table plans and retention settings for each table. And, more importantly, role-based access control (RBAC) permissions can be applied at table level, allowing you to control access to telemetry in a more granular way. Note: If you want to see any of the images in this article in better quality, click on them to see them in their original size. This can also be configured using Azure CLI command az monitor diagnostic-settings create. Make sure that the --export-to-resource-specific parameter is set to true, which is the equivalent of selecting Resource specific for Destination table in portal UI. Setting this parameter to false, would mean that you want to use AzureDiagnostics as the destination table, which we don't recommend using. Step 5 — Run some workload In this phase the script loads a TPC-H schema and executes workload SQL across different nodes so that you can prove replica capture. Query it in Log Analytics Once the workload completed and data was streamed to Log Analytics, you can open your Log Analytics workspace, and start querying the relevant tables. If you don't know how to start issuing queries in a Log Analytics workspace, refer to Get started with log queries in Azure Monitor Logs. In your Log Analytics workspace, when you select Logs in the resource menu, you can access the Queries hub. By default, it should open automatically unless you have configured it to not show, in which case you can open by selecting Queries hub on the top right corner of the Logs home screen. If you add a filter in the queries hub for Resource type equals Azure Database for PostgreSQL Flexible Server, you'll be able to access multiple examples of queries which might help you get started querying the log categories we support for our service. You can run any of them by selecting Run on the summarization card that describes the query or, if you hover the mouse over the card, you can select Load to editor so that the query is copied over to the active query window, and you can run it or modified it further. Following, there are a few more query examples which can be useful to analyze the workload executed in this experiment. Top queries by total time (across all nodes) To get the list of 10 queries with higher duration from the ones that ran on any of the three nodes. KQL PGSQLQueryStoreRuntime | summarize total_time_ms = sum(TotalExecDurationMs) by QueryId, LogicalServerName | top 10 by total_time_ms desc Results Important: Results might be slightly different on each execution of the experiment. Where queries wait on each node List the most frequent wait events observed on user initiated queries across all nodes. KQL PGSQLQueryStoreWaits | join kind=inner (PGSQLQueryStoreRuntime) on QueryId | summarize total_waits_sampled = sum(Calls) by Event, EventType, LogicalServerName | order by total_waits_sampled desc Results Important: Results might be slightly different on each execution of the experiment. Decode query IDs (join runtime stats with SQL text) Top 20 queries the most frequent wait events observed on user initiated queries across all nodes. KQL PGSQLQueryStoreRuntime | join kind=inner (PGSQLQueryStoreQueryText) on QueryId | where QueryType == 'select' | project LogicalServerName, QueryId, TotalExecDurationMs, QueryText | top 20 by TotalExecDurationMs desc Results Important: Results might be slightly different on each execution of the experiment. Compare primary vs replicas (workload distribution) Find total number of query executions and accumulated duration of all those executions for each node. KQL PGSQLQueryStoreRuntime | summarize execs = sum(Calls), total_time_ms = sum(TotalExecDurationMs) by LogicalServerName | order by total_time_ms desc Results Important: Results might be slightly different on each execution of the experiment. Replica-only hotspots (find what’s slow off the primary) Find top 10 queries executed by their aggregated duration, focusing on what was executed on read replicas only. KQL let Replicas = dynamic(["pgqswait20260505220501-readreplica", "pgqswait20260505220501-cascadereadreplica"]); PGSQLQueryStoreRuntime | where LogicalServerName in (Replicas) | summarize total_time_ms = sum(TotalExecDurationMs) by QueryId, LogicalServerName | top 10 by total_time_ms desc Results Important: Results might be slightly different on each execution of the experiment. QPI now supports query store stats collected on replicas You can now use Query Performance Insight workbooks to analyze query store information not only on your primary server, as you were used to, but you can also get that valuable information on your read replicas. Why replica workload capture is a big deal This is the unlock: you can now answer performance questions in replica-heavy architectures without stitching together partial signals. Per-node truth: see the slow queries on the node where they actually ran (primary vs replica vs cascading replica). Faster root cause: runtime + waits gives you “slow” and “why” in one place. Replica tuning that sticks: identify replica-specific bottlenecks (I/O saturation, lock waits, CPU pressure) and tune with evidence. Centralized observability: export to LAWS so you can build dashboards, alerts, and cross-server comparisons with KQL. Unlock query visibility: Access query text without database permissions. Fine grain control on who can view query text: Using resource specific tables in LAWS, you can decide which users can access the table in which text of the queries is kept. Parameter-aware debugging: sampled parameters can help reproduce issues and explain plan changes, but they’re stored locally in azure_sys and not exported to LAWS. In practice, rely on the primary for parameter inspection (replicas have read-only azure_sys). Operational notes (quick but important) Expect a delay: Query store stats and LAWS ingestion aren’t instant. Give it a few minutes after running workload. Mind retention: Query store retention and Log Analytics retention are separate knobs. Tune them to balance troubleshooting value and cost. Production hygiene: don’t use wide-open firewall rules outside of a demo. Clean up When you’re done, delete the resource group: az group delete --name <RESOURCE_GROUP> --yes --no-wait Bottom line Query store in Azure Database for PostgreSQL flexible server now matches how modern architectures run—across primary, read replicas, and cascading replicas—and LAWS gives you a single place to query, compare, and act.444Views13likes0CommentsWhen PostgreSQL v17 Chooses the Wrong Plan: A Deep Dive into CTEs with Data Skew
Common Table Expressions or CTEs provide temporary named result set that can be referenced within single complex SQL statements. CTE offer great flexibility in breaking down complex SQL queries into multiple readable parts and enables recursion. CTEs allows materialized result set which allows users to refer the result set multiple times. PostgreSQL 17 propagates column statistics from MATERIALIZED CTEs into the parent query, allowing the planner to estimate hash aggregation and join cardinalities more accurately. However, when queries combine highly skewed data distributions with runtime randomness (for example, ORDER BY random() LIMIT 1), this increased planner confidence can sometimes lead to execution plans that are correct in theory but catastrophic for specific runtime values. What is Data Skew and how it affects plans? Data skew is an uneven distribution of values in a column, where one or a few values occur far more often than the rest. To choose an efficient execution plan, PostgreSQL’s planner estimates row counts using statistics such as n_distinct, histograms, and most-common-values (MCV) frequencies stored in pg_statistic. If these estimates are off—especially if selectivity is overestimated—query performance can degrade significantly. Baseline Query and Observed Planner Behavior To gauge the performance impact of data skew with PGv17, we generated data into 3 tables relating to Accounts, Purchases and Audit. This baseline query selects a single random HIGH_RISK account and retrieves matching rows from a large audit_logs table via two materialized CTEs. Original Query with selected_account as materialized ( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1 ), audit_purchases as materialized ( select distinct p.account_id from purchases p join selected_account sa on p.account_id = sa.account_id ) select al.* from audit_logs al where exists ( select 1 from audit_purchases ap where al.account_id = ap.account_id); PostgreSQL 14 Execution Plan PostgreSQL 14 treats materialized CTEs as optimization fences and does not propagate detailed statistics into the parent query. As a result, the planner assumes relatively small CTE result sets and consistently favors nested loop joins with parameterized index scans. While these estimates are not strictly accurate, this conservative planning approach prevents full-table scans and avoids catastrophic performance under extreme data skew. In this workload, PostgreSQL 14 completes the query in approximately 7 ms. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Nested Loop (cost=1412.94..212757485.16 rows=1500600064 width=27) (actual time=5.226..7.192 rows=30 loops=1) Buffers: shared hit=277 read=13 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.460..4.460 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.459..4.459 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.011..3.533 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> HashAggregate (cost=317.44..355.57 rows=3813 width=4) (actual time=4.735..4.741 rows=1 loops=1) Group Key: p.account_id Batches: 1 Memory Usage: 217kB Buffers: shared hit=274 read=1 -> Nested Loop (cost=0.44..292.52 rows=9966 width=4) (actual time=4.722..4.726 rows=10 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.461..4.461 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..192.84 rows=9966 width=4) (actual time=0.260..0.262 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> HashAggregate (cost=85.79..87.79 rows=200 width=4) (actual time=4.743..4.744 rows=1 loops=1) Group Key: ap.account_id Batches: 1 Memory Usage: 40kB Buffers: shared hit=274 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..76.26 rows=3813 width=4) (actual time=4.737..4.742 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..817780.34 rows=24600001 width=27) (actual time=0.480..2.438 rows=30 loops=1) Index Cond: (account_id = ap.account_id) Buffers: shared hit=3 read=12 Planning Time: 1.121 ms Execution Time: 7.253 ms (37 rows) PostgreSQL 17 Execution Plan PostgreSQL 17 correctly estimates that the audit_purchases CTE may contain thousands of rows and that joining it with audit_logs could produce tens of millions of matches. Based on these estimates, the planner selects a Hash Semi Join with a Sequential Scan on audit_logs, which is optimal when many rows are expected to match. However, at runtime the selected account produces only ~30 matching audit rows. Despite this, PostgreSQL must execute the chosen plan fully, scanning millions of rows from disk. This results in an execution time of 521,625 ms (~8.7 minutes). Importantly, this behavior is not a cost estimation bug—the estimates are statistically correct. The failure occurs because the PostgreSQL 17 planner estimates a large matching result set based on the cardinality of values produced by the audit_purchases CTE, and consequently chooses a plan optimized for that estimated volume rather than for the single runtime-selected account. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Semi Join (cost=1691.60..93353890.16 rows=3001396992 width=26) (actual time=86577.858..521625.286 rows=30 loops=1) Hash Cond: (al.account_id = ap.account_id) Buffers: shared hit=28879 read=22040419 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.486..4.487 rows=1 loops=1) Buffers: shared hit=274 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.485..4.486 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=274 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.021..3.551 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> HashAggregate (cost=658.72..673.28 rows=1456 width=4) (actual time=4.907..4.912 rows=1 loops=1) Group Key: p.account_id Batches: 1 Memory Usage: 73kB Buffers: shared hit=277 read=1 -> Nested Loop (cost=0.44..606.86 rows=20742 width=4) (actual time=4.898..4.902 rows=10 loops=1) Buffers: shared hit=277 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.487..4.487 rows=1 loops=1) Buffers: shared hit=274 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=0.410..0.411 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> Seq Scan on audit_logs al (cost=0.00..52082989.92 rows=3001396992 width=26) (actual time=0.012..286755.571 rows=3001400050 loops=1) Buffers: shared hit=28602 read=22040418 -> Hash (cost=29.12..29.12 rows=1456 width=4) (actual time=4.919..4.920 rows=1 loops=1) Buckets: 2048 Batches: 1 Memory Usage: 17kB Buffers: shared hit=277 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..29.12 rows=1456 width=4) (actual time=4.908..4.911 rows=1 loops=1) Buffers: shared hit=277 read=1 Planning: Buffers: shared hit=197 Planning Time: 1.831 ms Execution Time: 521625.433 ms (38 rows) Forced nested loop Disabling hash joins at the session level forces PostgreSQL 17 to revert to a nested loop plan using index scans. This diagnostic step completes in ~8 ms, confirming that index-based execution is sufficient for the actual runtime workload. However, disabling planner features is not suitable for production use and serves only to validate the root cause. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=1677.62..3361930001.36 rows=3001396992 width=26) (actual time=5.513..8.194 rows=30 loops=1) Buffers: shared hit=277 read=12 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.524..4.524 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.523..4.523 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.014..3.608 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> HashAggregate (cost=658.72..673.28 rows=1456 width=4) (actual time=4.890..4.894 rows=1 loops=1) Group Key: p.account_id Batches: 1 Memory Usage: 73kB Buffers: shared hit=274 read=1 -> Nested Loop (cost=0.44..606.86 rows=20742 width=4) (actual time=4.882..4.885 rows=10 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.525..4.525 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=0.354..0.356 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> HashAggregate (cost=32.76..47.32 rows=1456 width=4) (actual time=4.896..4.899 rows=1 loops=1) Group Key: ap.account_id Batches: 1 Memory Usage: 73kB Buffers: shared hit=274 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..29.12 rows=1456 width=4) (actual time=4.892..4.895 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..1782455.82 rows=52656088 width=26) (actual time=0.615..3.283 rows=30 loops=1) Index Cond: (account_id = ap.account_id) Buffers: shared hit=3 read=11 Planning Time: 0.184 ms Execution Time: 8.252 ms (37 rows) Fix Strategies While the fix isn’t straightforward, we attempt to modify the query plan through query rewrite using following methods: Lateral Join with Offset 0 Limit on Purchases Using subqueries Lateral Join with Offset 0 Adding OFFSET 0 inside a LATERAL subquery introduces an optimizer barrier that prevents join reordering. This forces PostgreSQL to execute a parameterized nested loop, passing the runtime account ID into an index scan on audit_logs. With this rewrite, PostgreSQL 17 executes the query in 6.519 ms, a 99.9988% improvement over the original plan. with selected_account as materialized ( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1 ), audit_purchases as materialized ( select p.account_id from purchases p join selected_account sa on p.account_id = sa.account_id ) select al.* from audit_purchases ap join lateral ( select * from audit_logs al where al.account_id = ap.account_id offset 0) al on true; Execution plan QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=1578.44..42162640233.98 rows=1092192577296 width=26) (actual time=4.980..6.481 rows=300 loops=1) Buffers: shared hit=403 read=12 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.296..4.297 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.295..4.296 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.013..3.357 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> Nested Loop (cost=0.44..606.86 rows=20742 width=4) (actual time=4.585..4.588 rows=10 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.298..4.298 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=0.286..0.287 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..414.84 rows=20742 width=4) (actual time=4.587..4.591 rows=10 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..1506157.19 rows=52656088 width=26) (actual time=0.040..0.185 rows=30 loops=10) Index Cond: (account_id = ap.account_id) Buffers: shared hit=129 read=11 Planning: Buffers: shared hit=8 Planning Time: 0.238 ms Execution Time: 6.519 ms (31 rows) Limit on Purchases Applying LIMIT 1 to the audit_purchases CTE bounds its cardinality to a single row. This makes nested loops cheaper than hash joins during planning. PostgreSQL 17 selects an index‑driven execution plan and completes in 7.742 ms, yielding a 99.9985% improvement. with selected_account as materialized ( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1 ), audit_purchases as materialized ( select p.account_id from purchases p join selected_account sa on p.account_id = sa.account_id limit 1 ) select al.* from audit_logs al where exists (select 1 from audit_purchases ap where al.account_id=ap.account_id); Execution plan QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=972.07..933697.99 rows=24600001 width=27) (actual time=5.693..7.686 rows=30 loops=1) Buffers: shared hit=276 read=13 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.816..4.817 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.816..4.817 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.014..3.825 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> Limit (cost=0.44..0.47 rows=1 width=4) (actual time=5.133..5.134 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Nested Loop (cost=0.44..292.52 rows=9966 width=4) (actual time=5.132..5.132 rows=1 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.817..4.817 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..192.84 rows=9966 width=4) (actual time=0.313..0.313 rows=1 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> HashAggregate (cost=0.02..0.03 rows=1 width=4) (actual time=5.138..5.139 rows=1 loops=1) Group Key: ap.account_id Batches: 1 Memory Usage: 24kB Buffers: shared hit=274 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..0.02 rows=1 width=4) (actual time=5.135..5.135 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..686726.47 rows=24600001 width=27) (actual time=0.552..2.534 rows=30 loops=1) Index Cond: (account_id = ap.account_id) Buffers: shared hit=2 read=12 Planning Time: 0.216 ms Execution Time: 7.742 ms (35 rows) Using subqueries Rewriting the query using scalar subqueries converts the selected account ID into an InitPlan. The resulting value is then used as a runtime parameter in index scans against audit_logs. This eliminates join reordering opportunities and guarantees a parameterized access path. Both scalar subquery variants complete in approximately 7.3–7.6 ms, achieving over 99.9986% improvement. Subquery Example 1 select * from audit_logs al where al.account_id in (select p.account_id from purchases p where p.account_id=( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1)); Execution plan QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop Semi Join (cost=972.02..1302907.98 rows=24600001 width=27) (actual time=5.257..7.260 rows=30 loops=1) Buffers: shared hit=277 read=12 InitPlan 1 (returns $0) -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.612..4.613 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.611..4.611 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.012..3.674 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..686755.98 rows=24600001 width=27) (actual time=5.067..7.059 rows=30 loops=1) Index Cond: (account_id = $0) Buffers: shared hit=274 read=11 -> Materialize (cost=0.44..242.67 rows=9966 width=4) (actual time=0.006..0.006 rows=1 loops=30) Buffers: shared hit=3 read=1 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..192.84 rows=9966 width=4) (actual time=0.187..0.188 rows=1 loops=1) Index Cond: (account_id = $0) Heap Fetches: 0 Buffers: shared hit=3 read=1 Planning Time: 0.160 ms Execution Time: 7.286 ms (24 rows) Subquery Example 2 select * from audit_logs al where al.account_id=(select distinct p.account_id from purchases p where p.account_id=( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1)); Execution plan QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Index Scan using idx_account_audit on audit_logs al (cost=972.04..1507128.65 rows=52656088 width=26) (actual time=5.510..7.637 rows=30 loops=1) Index Cond: (account_id = (InitPlan 2).col1) Buffers: shared hit=277 read=12 InitPlan 2 -> Limit (cost=971.44..971.46 rows=1 width=4) (actual time=4.901..4.902 rows=1 loops=1) Buffers: shared hit=274 read=1 InitPlan 1 -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.455..4.456 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.455..4.455 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.018..3.522 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=4.900..4.900 rows=1 loops=1) Index Cond: (account_id = (InitPlan 1).col1) Heap Fetches: 0 Buffers: shared hit=274 read=1 Planning: Buffers: shared hit=4 Planning Time: 0.154 ms Execution Time: 7.663 ms (25 rows) Query execution summary The table below summarizes execution time improvements across all tested rewrites relative to the original PostgreSQL 17 and 14 query execution plans. Setup / Rewrite Execution Time (ms) Speed‑up Factor (×) PG17 original (Hash Semi Join + Seq Scan) 521,625.433 1.0× PG14 original query shape 7.253 71,919× Forced nested loop (hashjoin off – diagnostic) 8.252 63,212× LATERAL JOIN + OFFSET 0 6.519 80,016× LIMIT on purchases 7.742 67,376× Subquery – IN with InitPlan (Example 1) 7.286 71,593× Subquery – scalar subquery (Example 2) 7.663 68,071× Key Takeaways While PostgreSQL 17’s planner improvements are correct, intentional, and beneficial for most workloads. However, queries that combine CTEs, runtime randomness, and heavily skewed data can cause the planner to select globally optimal plans that are locally inefficient. When runtime values drastically narrow result sets, query rewrites that force parameterized index scans—such as LATERAL joins, scalar subqueries, or bounded CTEs—provide predictable and stable performance.281Views0likes0CommentsPgBouncer Best Practices in Azure Database for PostgreSQL – Part 1
Introduction Connection pooling is critical for scaling PostgreSQL workloads efficiently, especially in managed environments like Azure Database for PostgreSQL. PgBouncer, a lightweight connection pooler, helps manage thousands of client connections without overwhelming the database. Connection pooling is very important when managing multiple concurrent database requests, as PostgreSQL uses a process-per-connection model, which means too many active connections can: Increase context switching overhead Consume excessive CPU/memory Degrade performance under load PgBouncer addresses this by limiting active server connections and queuing the additional client requests. However, misconfiguring key settings such as default_pool_size can still lead to CPU/memory pressure, connection bottlenecks, and degraded performance. Careful planning and tuning are essential to avoid these pitfalls. Understanding connection pools Before diving into tuning, it’s important to understand how PgBouncer organizes connections: PgBouncer creates a separate pool for each unique (database, user) combination. For example: If you have 2 application roles/users connecting to 2 databases. In this scenario, PgBouncer will allocate 4 pools. Each pool maintains its own number of connections, determined by default_pool_size. So, the total number of potential server connections is: number_of_pools × default_pool_size This is why sizing default_pool_size correctly is critical. Azure PgBouncer defaults Azure Database for PostgreSQL comes with preconfigured PgBouncer settings optimized for most workloads. Understanding these defaults is essential before making any tuning changes: pool_mode: TRANSACTION (default in Azure; best for most workloads) default_pool_size: 50 (range: 1–4950) max_client_conn: 5000 (range: 1–50000) Transaction mode support for prepared statements PgBouncer now enables support for PostgreSQL PREPARED STATEMENTS when combined together with TRANSACTION mode pooling. Previously, in transaction mode cached plans were difficult to manage, as there was no way for PgBouncer to confirm whether a new connection allocated from the pool would benefit from any cached plans generated from prior PREPARED STATEMENT operations. To work around this scenario, PgBouncer now provides a parameter which controls how many globally cached plan statements remain in memory for any pooled connection to leverage. max_prepared_statements: 200 (range: 0-5000) PostgreSQL connection limits For large tiers (e.g., 96 vCores), the default max_connections is 5000, with 15 reserved for system use. That means 4985 user connections are available. For more details, see maximum connection. Sizing best practices Proper sizing ensures optimal performance and resource utilization. Here’s how to approach it: 1. Use transaction pooling Start by confirming that pool_mode = TRANSACTION is enabled. This is already the Azure default and provides the best pooling efficiency for most web applications. If your application is using prepared statements, ensure you configure max_prepared_statements accordingly. 2. Determine your maximum active concurrent database operations (max_concurrent_ops) Next, you need to estimate how many total concurrent active PostgreSQL backends your instance can maintain: For CPU-bound OLTP workloads: keep max_concurrent_ops near 1.5x -2x the number of CPU vCores. For I/O-heavy workloads: stay slightly higher than vCore count. Rule of thumb for 96 vCores: max_concurrent_ops ≈ 144–192. 3. Divide across pools Once you’ve estimated your max_concurrent_ops value, the next step is to distribute your capacity across all connection pools. default_pool_size ≈ max_concurrent_ops / number_of_pools Example: max_concurrent_ops = 144 number_of_pools = 4 default_pool_size = 144 / 4 = 36 Sample configuration To illustrate how these calculations translate into real-world settings, here’s a sample PgBouncer configuration tuned for a scenario with four pools and an Active_Backend_Target of 144. pool_mode = transaction default_pool_size = 36 ; tuned for 4 pools max_client_conn = 5000 Quick reference table For quick planning, the following table provides starting recommendations based on common Azure Database for PostgreSQL SKU sizes. Use these as a baseline and adjust according to your workload metrics. SKU Size Memory Default max_connections Pools Suggested max_concurrent_ops Starting default_pool_size 8 vCores 32 GiB 3437 2 12–16 6–12 16 vCores 64 GiB 5000 2 24–32 12–20 32 vCores 128 GiB 5000 2 48–64 30–40 48 vCores 192 GiB 5000 2 72–92 40–60 64 vCores 256 GiB 5000 2 96–128 50–70 96 vCores 384–672 GiB 5000 2 144–192 60–80 For all tiers ≥16 vCores, max_connections is capped at 5000 (with 15 reserved for system use). Notes: default_pool_size = max_concurrent_ops / number_of_pools These values are starting recommendations. You should validate them against actual workload metrics and adjust gradually. Always ensure: (number_of_pools × default_pool_size) < max_connections − 15 (reserved system slots) Monitoring and tuning After applying your configuration, continuous monitoring is key. Here’s how: Use PgBouncer metrics in Azure Monitor to track active, idle, and waiting connections. Run SHOW POOLS; for real-time stats; watch cl_waiting vs sv_idle. For detailed monitoring and management, visit the Admin Console. Recommended Alerts: Alert if waiting client connections > 0 while idle server connections = 0 (indicates pool exhaustion—consider increasing default_pool_size). Alert if active server connections approach the configured default_pool_size (may indicate need for tuning). Alert if max_client_conn utilization exceeds 80% (risk of client-side connection errors). Tip: If waiting client connections grow while idle server connections are zero, increase default_pool_size cautiously. Review performance regularly and adjust gradually. Common pitfalls Avoid these mistakes when configuring PgBouncer: Changing pool mode to SESSION by default: transaction pooling is better for most apps. Session mode will not release connections until the session is ended. Ignoring pool count: multiplying a large default_pool_size by many pools can exhaust connections. Confusing max_client_conn with Postgres capacity: PgBouncer can accept many more clients than the server concurrent processes can support, any client connections not being processed will be waiting for resources. Tuning without data: always review metrics before changes. Conclusion Choosing the right default_pool_size in Azure Database for PostgreSQL with PgBouncer is about balancing performance and resource efficiency. With built-in PgBouncer in Flexible Server, you can enable connection pooling with a single parameter making it easy to get started quickly. The default settings are optimized for most workloads, and as your requirements grow, you can further tune parameters like default_pool_size and max_connections to suit your needs. By understanding your workload, estimating an active concurrent operations, dividing it across pools while respecting PostgreSQL limits, and continuously monitoring and adjusting based on real data, you can achieve a stable, scalable, and cost-effective environment. Further reading For more in-depth guidance and real-world scenarios on PgBouncer configuration and tuning in Azure Database for PostgreSQL, explore the following resources: Leverage built-in PgBouncer in Flexible Server Monitoring PgBouncer in Azure PostgreSQL Flexible Server Identify and solve connection performance in Azure Postgres Not all Postgres connection pooling is equal Connection handling best practice with PostgreSQLPotential Consequences of Using Postgres as a Job Queue
Introduction At small scale, using Postgres as a job queue is totally fine, and I’d even say it’s the right call. Fewer moving parts, one less system to manage, ACID guarantees on your jobs. What’s not to love? The problem is that “small scale” has a ceiling, and the ceiling is lower than most people expect. When you’ve got thousands of concurrent workers hammering a jobs table with SELECT ... FOR UPDATE SKIP LOCKED , things start to behave in ways that aren’t obvious from the application layer. CPU usage creeps up. Also vacuum sometimes can’t keep up. Finally, in the wait event stats, you start seeing ominous entries like LWLock:MultiXactSLRU stacking up across many backends. This pattern has tripped up teams more than a few times, and it usually plays out the same way: everything works fine in dev and staging, then goes off a cliff in production once the concurrency gets real. So let’s dig into why this happens, and what the alternatives look like. The Typical Pattern When using Postgres as a job queue, the standard approach looks something like this: CREATE TABLE job_queue ( id bigserial PRIMARY KEY, status text NOT NULL DEFAULT 'pending', payload jsonb NOT NULL, created_at timestamptz NOT NULL DEFAULT now(), locked_by text, locked_at timestamptz ); CREATE INDEX idx_job_queue_status ON job_queue (status) WHERE status = 'pending'; Workers grab jobs with: UPDATE job_queue SET status = 'processing', locked_by = 'worker-42', locked_at = now() WHERE id = ( SELECT id FROM job_queue WHERE status = 'pending' ORDER BY created_at LIMIT 1 FOR UPDATE SKIP LOCKED ) RETURNING *; And then mark them done: UPDATE job_queue SET status = 'completed' WHERE id = $1; Some users may DELETE the row entirely. Either way, the lifecycle is: insert, lock-and-update, update-or-delete. Repeated thousands of times per second. At low concurrency, this works very smoothly. SKIP LOCKED means workers don’t block each other waiting for the same row. Postgres handles the locking, visibility, and ordering. It’s elegant. So where does it break? The MultiXact SLRU Problem When multiple transactions hold locks on the same row, Postgres stores the set of lockers as a MultiXact ID – a pointer into a side structure under pg_multixact/ . With SELECT ... FOR UPDATE SKIP LOCKED , users might think MultiXacts aren’t involved – after all, SKIP LOCKED is supposed to avoid contention. But in practice, with many concurrent workers all racing to lock rows, there are brief windows where multiple transactions reference the same row before one of them “wins” and the others skip. If you combine this with any FOR SHARE or FOR KEY SHARE locks (which are commonly created implicitly by foreign key checks), MultiXact IDs start accumulating quickly. The MultiXact data lives in SLRU buffers (Simple Least Recently Used) – a small, fixed-size shared memory cache. When backends need to read or write MultiXact data, they acquire LWLocks to access these buffers. Under high concurrency, this becomes a bottleneck: wait_event_type | wait_event -----------------+------------------- LWLock | MultiXactMemberSLRU LWLock | MultiXactOffsetSLRU You’ll see dozens or hundreds of backends piled up on these waits. The SLRU cache is small (by design – it’s a fixed number of pages in shared memory), and when the working set of MultiXact lookups exceeds what fits in the cache, you get constant eviction and re-reads from disk. Every lock acquisition and release on a job row potentially triggers a MultiXact SLRU lookup, and at thousands of concurrent sessions, those lookups serialize on LWLocks. The result: CPU gets pegged, throughput collapses, and latency spikes – not because the queries are expensive, but because the locking infrastructure itself is overwhelmed. Bloat: The Silent Killer The other side of this coin is table and index bloat. Every job row goes through multiple updates (and possibly a delete), and each of those operations creates a new tuple version in the heap. The old versions stick around until VACUUM cleans them up. On a busy job queue table: Dead tuples accumulate faster than autovacuum can clean them. By the time autovacuum finishes one pass, tens of thousands of new dead tuples have appeared. The table grows and grows. Index bloat compounds the problem. Every index on the table also accumulates dead entries. The partial index on status = 'pending' gets thrashed especially hard, since rows constantly enter and leave that condition. Sequential scans get slower. As the table bloats, even index scans start doing more I/O because the heap pages are sparsely populated. Vacuum reclaims space at the end of the table, but can’t reclaim space in the middle (unless the pages are completely empty). Job queue tables can grow to tens of gigabytes when the actual “live” data was only a few megabytes. It makes everything slower: scans, vacuum, even pg_dump. You can mitigate this by running vacuum more aggressively (lower autovacuum_vacuum_scale_factor , higher autovacuum_vacuum_cost_limit ), or by partitioning the table and dropping old partitions. But at some point, you’re fighting the fundamental mismatch between MVCC’s design goals and the write pattern of a job queue. CPU and Lock Overhead Beyond the SLRU contention and bloat, there’s just the raw overhead of using Postgres’s full transactional machinery for what is essentially a FIFO dispatch operation: Every lock/unlock is a full WAL-logged transaction. Grabbing a job writes WAL. Marking it complete writes WAL. Deleting it writes WAL. On a system processing thousands of jobs per second, the WAL volume from the job queue alone can saturate your wal_writer and checkpoint processes. SKIP LOCKED still touches rows. The name suggests rows are skipped, but Postgres still has to find them, check their lock status, and move on. With high concurrency, many workers end up scanning past the same locked rows before finding one they can claim. This is wasted CPU. Snapshot management overhead also becomes an issue. Each transaction needs a consistent snapshot, and with thousands of concurrent transactions, the ProcArray (the structure that tracks active transactions) becomes a contention point itself. You might see LWLock:ProcArrayLock waits alongside the MultiXact ones. Vacuum contention. While vacuum is cleaning up dead tuples, it needs locks too. On a table under constant write pressure, vacuum can interfere with the workers and vice versa. I’ve seen systems where disabling autovacuum on the job queue table improved throughput in the short term. Better Alternatives So what should you use instead? It depends on your requirements, but there are several options that handle high-throughput job dispatch more gracefully than a Postgres table. Advisory Locks (Staying in Postgres) If you want to stay within Postgres and avoid adding infrastructure, advisory locks are worth considering for certain queue patterns. Instead of locking rows, you lock on an abstract numeric key: -- Worker tries to acquire a lock on the job ID SELECT pg_try_advisory_lock(id) FROM job_queue WHERE status = 'pending' ORDER BY created_at LIMIT 1; Advisory locks are lightweight – they don’t touch the heap, don’t create MultiXact entries, and don’t generate dead tuples. They live entirely in shared memory. The trade-off is that you lose the atomicity of FOR UPDATE SKIP LOCKED : you need to handle the case where a lock is acquired but the job processing fails, and you need to release the lock explicitly (or rely on session-end cleanup). This approach works well when the queue depth is manageable and you want to avoid the MVCC overhead. But it’s still Postgres, so you’re still subject to connection limits, ProcArray overhead, and general resource contention at very high session counts. pgq (Skytools) pgq is purpose-built for exactly this problem. It’s a queue implementation that sits inside Postgres but uses a batching model that avoids most of the row-level locking and MVCC pitfalls. Events are written to a queue table, but consumers read them in batches and the queue maintenance is done via a ticker process that manages rotation. The key advantages: No row-level contention. Consumers don’t lock individual rows. Built-in batch processing. Events are consumed in chunks, reducing transaction overhead. Efficient cleanup. Old events are rotated out rather than vacuumed row-by-row. The downside is that pgq is not as actively maintained as it once was, and it adds operational complexity (the ticker daemon, consumer registration, etc.). But for teams already deep in the Postgres ecosystem, it’s a battle-tested option. Redis For many teams, Redis is the natural choice for job queues. Using Redis lists (BRPOPLPUSH or the Streams API), you get: Sub-millisecond dispatch latency. No disk I/O, no MVCC, no vacuum. Atomic pop operations. Workers grab jobs without any locking protocol. Simple scaling. Redis handles thousands of concurrent consumers trivially. The trade-off is durability. Redis can persist to disk, but it’s not ACID. If Redis crashes between a pop and the job completing, you might lose or duplicate work (though Redis Streams with consumer groups mitigate this significantly). For most job queue use cases, at-least-once delivery is acceptable, and Redis does that well. Kafka For truly high-throughput, distributed workloads, Apache Kafka is the heavyweight option. Kafka partitions give you parallel consumption with ordering guarantees per partition, durable storage, and replay capability. It’s the right tool when: You need to process thousands of events per second Multiple consumers need to read the same events You want event replay or audit trails Your architecture is already event-driven The operational overhead is nontrivial – ZooKeeper (or KRaft), brokers, topic management, consumer group coordination. But for teams already running Kafka for other reasons, adding a job queue topic is practically free. Choosing the Right Tool Here’s a rough decision guide: Under 100 concurrent workers, simple jobs, Postgres with SKIP LOCKED is fine Moderate concurrency, want to stay in Postgres, Advisory locks or pgq High throughput, low-latency dispatch, Redis (Lists or Streams) Massive scale, distributed, event replay, Kafka Many teams that start with Postgres (reasonably) hit scaling problems and then try to fix Postgres rather than recognizing that the workload has outgrown the tool. They throw more autovacuum workers at it, increase max_connections , add connection poolers – all of which help at the margins, but don’t address the fundamental issue: Postgres’s MVCC and locking machinery wasn’t designed for this access pattern at high concurrency. Conclusion Postgres is great, but it can’t be the best tool for every job. Using it as a job queue is a perfectly valid choice when your scale is modest. But when you’re running thousands of concurrent workers, the combination of MultiXact SLRU contention, heap bloat, vacuum pressure, and raw locking overhead will eventually push you toward a purpose-built solution. The good news is that you don’t have to rip out everything. Advisory locks can buy you headroom without adding infrastructure. Redis can handle dispatch while Postgres keeps owning the data. And if you’re already using Kafka, a job topic is a natural fit. Take your pick – there are many queueing options out there!369Views4likes0Comments