performance
561 TopicsApril 2026 Recap: Azure Database for PostgreSQL
April brought several updates for Azure Database for PostgreSQL, focused on improving developer productivity, strengthening security and connectivity, and helping customers scale and optimize their PostgreSQL workloads. From new Entra ID token refresh libraries across .NET, JavaScript, and Python to simplify authentication, to guidance on migrating from VNet to Private Endpoint capable configurations, we continue to make it easier to build and manage secure applications. We also introduced enhancements to the PostgreSQL VS Code extension and published deep dives on query performance, data modeling, and real-world scaling patterns. We also published a blog on how PostgreSQL enters its AI era, which explores ways with which developers can adapt PostgreSQL to meet the needs of AI-driven and rapidly growing applications, with practical guidance on running and scaling PostgreSQL more effectively in these evolving workloads. POSETTE 2026 Before we dive deeper into the feature updates, POSETTE: An Event for Postgres 2026 is just around the corner, PostgreSQL’s free, virtual conference bringing together the global community. Taking place from June 16–18, the event will feature four livestream tracks with a strong lineup of content, including 44 sessions, 2 keynotes, and 50 speakers. It’s a great opportunity to hear from PostgreSQL experts, learn about the latest trends, and discover real-world best practices across a wide range of topics. Register today for updates and be part of three days of learning, insights, and community-driven discussions across a wide range of PostgreSQL topics. Features Entra-ID token refresh libraries for .NET, JavaScript, and Python: Preview Migrating from VNet to Private Endpoint: Preview New enhancements in the PostgreSQL VS Code Extension Improving Query Performance and Modeling in PostgreSQL Scaling PostgreSQL for Real-World Application Workloads Learning Bytes: Preventing accidental server deletion Entra-ID Token refresh libraries: .NET, JavaScript and Python We’ve introduced Entra ID token refresh libraries for .NET, JavaScript, and Python to simplify how applications authenticate with Azure Database for PostgreSQL using Entra ID. When using Entra ID–based authentication, access tokens are short-lived and need to be refreshed periodically. This often requires additional logic in the application to handle expiration, retries, and reconnection scenarios. These new libraries take care of that complexity by automatically refreshing tokens behind the scenes, so applications can maintain uninterrupted database connections without custom token management. With built-in support for token renewal, these libraries help: Reduce the need for manual token refresh logic in your application code Improve reliability for long-running or connection-pooled workloads Simplify adoption of Entra ID authentication across different language stacks Whether you're building new applications or migrating existing ones to use Entra ID, these libraries make it easier to integrate secure, passwordless authentication while keeping connection handling straightforward. Migrating from VNet to Private Endpoint Azure Database for PostgreSQL flexible server can now be migrated from a VNet‑integrated deployment to a network configuration that supports Private Endpoint connectivity. Servers originally deployed inside a VNet may require greater flexibility in networking management. Private Endpoints provide a simpler and more scalable model. Following migration, private access to the server continues over Azure’s backbone network, dependency on delegated subnets is reduced, and database networking can be better aligned with evolving architectural or organizational standards. The migration can be initiated through Azure CLI, API, or SDK and is designed to be straightforward. Although the operation involves a period of downtime, it enables adoption of Private Endpoint connectivity without recreating the server or manually moving data. After migration, Private Endpoints or firewall rules can be configured based on the desired access model, and infrastructure-as-code templates can be updated accordingly. Read more here: Migrate from VNet to a Private Endpoint Capable Network Configuration | Microsoft Learn New enhancements in the PostgreSQL VS Code Extension The latest release (v1.21) of the PostgreSQL VS Code extension delivers enhancements to query authoring and analysis workflows, improved cross-extension interoperability, reliability improvements across Object Explorer and connection management, and a set of targeted bug fixes. Schema-Aware Query Creation: You can now open a new query directly from a schema in Object Explorer, automatically setting the appropriate search_path so unqualified object names resolve correctly without additional setup. Query Plan Visualization Enhancements: The query plan visualizer now uses PostgreSQL-specific node icons across all views, making it easier to identify scan, join, and aggregate operations during performance analysis. Improved Multi-Extension Compatibility: The extension now coordinates editor ownership with the MSSQL extension when both are installed, reducing duplicate UI actions and avoiding conflicts in query execution workflows. Object Explorer Reliability Improvements: The Object Explorer has been refactored for more consistent refresh, expansion, and reconnection behavior, especially in long-running sessions and databases with many schemas. Enhanced IntelliSense Behavior: IntelliSense now respects the configured search_path, improving the relevance of suggestions and helping you work more efficiently across schemas. Bug Fixes: This release includes fixes across object scripting (including partitioned tables), connection profile handling, Docker container creation, and initial extension setup for improved reliability and stability. Improving Query Performance and Modeling in PostgreSQL This month, we also shared a set of technical blogs highlighting advanced PostgreSQL scenarios and practical guidance for real-world workloads: Guide on workload observability with Query store: This blog dives into how Query Store can be used to gain end-to-end visibility into query performance across both primary and replica nodes. It highlights the importance of understanding query behavior in distributed setups and how bottlenecks can surface differently across nodes. The post also shares practical guidance on using these insights to troubleshoot issues and optimize workload performance effectively. Guide on Common Table Expressions(CTEs) with Data Skew: This deep dive unpacks a complex query planning scenario in PostgreSQL v17, where data skew can lead to unexpected and suboptimal execution plans involving CTEs. It explains why the optimizer may choose inefficient plans and how this impacts real-world workloads. The blog also outlines strategies to diagnose and mitigate these issues, helping users better predict and tune query performance. Guide on PostgreSQL as a Graph Database: This blog explains how PostgreSQL can be leveraged to model and query graph-like relationships, making it highly relevant for AI-driven applications. It demonstrates how relational capabilities can be extended to support graph workloads without introducing additional systems. The post also highlights practical patterns and use cases that enable developers to build more connected, intelligent applications using PostgreSQL as a unified data platform. Scaling PostgreSQL for Real-World Application Workloads Alongside performance tuning and data modeling topics, we also explored how PostgreSQL behaves under real-world application patterns especially in scenarios involving high concurrency, background job processing, and connection-heavy workloads. These blogs focus on common architectural choices developers make and the trade-offs to consider when scaling reliably. Guide on using Postgres as a Job Queue: Thisblog takes a deeper look at the implications of using PostgreSQL as a job queue, a pattern commonly adopted for simplicity and tighter integration. It walks through how queue-like workloads can introduce contention due to frequent updates, row locking, and long-running transactions. The post highlights how these patterns can impact throughput, vacuum efficiency, and overall database health as scale increases. It also discusses when this approach is appropriate, and when teams should consider dedicated queuing systems to avoid performance bottlenecks. Guide on Connection Scaling with Elastic Clusters: This blog dives into the challenges of handling large volumes of concurrent connections, which is a common bottleneck for modern, microservices-based applications. It explains how Elastic Clusters help distribute connections and workload across multiple nodes, improving scalability and resilience under heavy load. The post also touches on connection management patterns, including pooling strategies, and how they work in conjunction with Elastic Clusters to prevent resource exhaustion and ensure consistent performance at scale. Azure Postgres Learning Bytes 🎓 Preventing accidental server deletion In production environments, accidental deletions can lead to significant downtime and data loss. To safeguard critical resources like Azure Database for PostgreSQL servers, Azure provides resource locks that add an extra layer of protection beyond standard role-based access control (RBAC). A commonly used option is the CanNotDelete (Delete Lock), which ensures that a resource cannot be deleted even by users with elevated permissions until the lock is explicitly removed. You can apply a delete lock easily using the Azure CLI by targeting the specific resource: az lock create --name PreventDelete --lock-type CanNotDelete --resource-group <rg-name> --resource-type Microsoft.DBforPostgreSQL/flexibleServers --resource-name <resource-name></resource-name></rg-name> Once applied, any delete operation on the resource will be blocked, helping prevent accidental or unintended deletions during maintenance, deployments, or testing. Locks can be applied at different levels subscription, resource group, or individual resources allowing flexibility based on your protection needs. For more details and step-by-step guidance, read our blog on Preventing accidental deletion of an Azure PostgreSQL Instance.185Views1like0CommentsEnd-to-end workload observability with Query Store for primary and replicas
Query performance doesn’t stop at the primary Most PostgreSQL architectures don’t run on a single node anymore. Reads get offloaded. Replica chains grow. And when performance issues hit, the hardest part is often simple: where did the queries actually run? With the latest query store capabilities in Azure Database for PostgreSQL flexible server, you can now capture workload executed not just on the primary, but also on read replicas—including cascading read replicas—and export the captured runtime stats, wait stats, and query text into Azure Monitor Logs (Log Analytics workspace / LAWS). See the real hotspot: isolate which node (primary vs replica) is slow. Know why: break down time by waits (CPU, I/O, locks) per query. Connect the dots: correlate query IDs to query text, and inspect sampled parameters locally in azure_sys on the primary when you need input context (parameters aren’t exported to LAWS). Centralize analysis: query everything with KQL in LAWS, across servers. What you’ll build This post walks through a reproducible demo that provisions a primary server, a read replica, and a cascading read replica, then runs a TPC-H–based workload across all three to generate query store data you can analyze locally and in Log Analytics. Enable query store capture (including query text) and parameter sampling for parameterized queries. Enable wait sampling so query store can record wait statistics. Export runtime stats, wait stats, and SQL text to LAWS using resource-specific tables. Validate capture on read replicas and cascading read replicas (not just the primary). Prerequisites Azure CLI logged in (az login) and permission to create a resource group, Log Analytics workspace, and PostgreSQL flexible servers. psql and curl available on your machine. PostgreSQL flexible server on General Purpose or Memory Optimized tier (query store and replicas aren’t supported on Burstable). PostgreSQL 14+ to test out cascading replicas. Networking: the script opens firewall access broadly for demos—tighten for production. Architecture (primary + replica chain + LAWS) You’ll deploy four resources: Primary server: read/write node. Read replica (level 1): read-only node created from the primary. Cascading read replica (level 2): read-only node created from replica level 1. Log Analytics workspace (LAWS): central place to query Query Store telemetry across all nodes. If Diagnostic Settings is properly configured, each server streams query store telemetry to LAWS—but how it’s kept locally differs by role. On the primary, query store data is recorded in-memory, then persisted locally in the azure_sys database, and then exported to LAWS. On read replicas (including cascading replicas), query store data is recorded in-memory only and then exported to LAWS. Bottom line: use LAWS for fleet-wide visibility, and use the primary’s azure_sys when you need deep local inspection (like parameter samples). Deploy the demo environment The fastest way to reproduce the scenario is to run the end-to-end bash script which you can download from https://raw.githubusercontent.com/Azure-Samples/azure-postgresql-query-store/refs/heads/main/may2026/script/query_store_demo.sh Save the file to a local directory in your Linux shell, and name the file query_store_demo.sh. To invoke the script, at minimum, you must assign a string password for the administrator login of the instances of the flexible servers it creates, and invoke the script like this: ADMIN_PASSWORD=<Your_Strong_Password> ./query_store_demo.sh Optionally, you can also override default values for other environment variables used by the script: Variable Purpose Default SUBSCRIPTION_ID Azure subscription ID to use (current default subscription) BASE_NAME Base name for all resources (used in naming servers, resource groups, etc.) pgqswait{YYYYMMDDHHMMSS} RESOURCE_GROUP Azure resource group name rg-{BASE_NAME} LOCATION Azure region for resources southeastasia PRIMARY_SERVER Name of primary PostgreSQL server {BASE_NAME}-primary REPLICA_1 Name of first-level read replica {BASE_NAME}-readreplica REPLICA_2 Name of second-level cascading read replica {BASE_NAME}-cascadereadreplica LOG_ANALYTICS_WORKSPACE Log Analytics workspace name law-{BASE_NAME} LOG_ANALYTICS_LOCATION Azure region for Log Analytics workspace southeastasia ADMIN_USER PostgreSQL admin username pgadmin ADMIN_PASSWORD PostgreSQL admin password (REQUIRED) SKU_NAME PostgreSQL server SKU (compute tier) Standard_D4ds_v5 TIER PostgreSQL pricing tier GeneralPurpose STORAGE_SIZE Storage size in GB 64 VERSION PostgreSQL version (minimum 14 for cascading replicas) 17 PRIMARY_DATABASE Initial database name postgres SQL_BASE_URL Base URL for downloading SQL scripts https://raw.githubusercontent.com/Azure-Samples/azure-postgresql-query-store/refs/heads/main/may2026/script/query_store_demo.sh TPCH_DDL_URL URL for TPC-H schema DDL file {SQL_BASE_URL}/schema/tpch_ddl.sql WORKLOAD_REPETITIONS Number of times to execute each workload query (minimum 5) 10 AUTO_APPROVE Skip confirmation prompt and proceed automatically false If, for example, you want to not only pass the ADMIN_PASSWORD but also override the LOCATION, you could do it like this: ADMIN_PASSWORD=<Your_Strong_Password> LOCATION=canadacentral ./query_store_demo.sh In a bit over 1 hour, the script will do the following steps: Step 1 — Provision first part of the infrastructure The infrastructure provisioned in this phase consists of: A resource group in which all resources are deployed. An instance of Log Analytics workspace, where all flexible server instances will send their query store related logs. A primary (read-write) flexible server. Step 2 — Configure primary server Now it's time to configure one new server parameters on your primary server so that query store emits query text to LAWS, so that we can correlate quey IDs to something recognizable. Query IDs are great for aggregation—but you still need the SQL. Turn on query text emission so you can correlate runtime and waits back to the actual statement text. Do this by setting pg_qs.emit_query_text to on. Refer to our documentation to learn how to set the value of a server parameter. Step 3 — Provision second part of the infrastructure The infrastructure provisioned in this phase consists of: A read replica (read-only) whose source is the primary server. A cascade read replica (read-only), whose source is the previously created read replica. Notice that when read replicas are created, they inherit the server parameter values from their source server. Because we have configured query store related settings on the primary server already, the intermediate read replica inherits its server parameters from that primary, and the cascade read replica inherits them from the intermediate replica. Step 4 — Export query store to Log Analytics (LAWS) Now for the payoff, we want to stream the data to Log Analytics so you can query across nodes, build dashboards, and alert. The script configures diagnostic settings on the primary and both replicas to send logs to a Log Analytics workspace using resource-specific tables. This is the key to cross-node visibility: each server exports its own captured telemetry, and you can slice by resource in a single KQL query. Query store runtime stats: execution counts, elapsed time, and other performance counters. Query store wait stats: wait breakdown attributed to queries. Query store SQL text: query text to decode query IDs. Note: Query store parameter samples are not included in the Log Analytics export. Parameters are stored locally per server in azure_sys, and on read replicas azure_sys is read-only—so don’t depend on replicas for parameter inspection. LAWS receives runtime stats, wait stats, and query text. Diagnostics settings for an instance of flexible server can be configured via portal. In the resource menu, under Monitoring, select Diagnostic settings. Add a new diagnostic setting, select a destination Log Analytics workspace, and the individual log catergories which you want to stream to that LAWS, and save the changes. For Destination table it's highly recommended to use Resource specific (one table per signal with proper schema) over Azure diagnostics (legacy one table for everything). With Azure diagnostics, all logs from all resource types land into a single table (AzureDiagnostics). It's a wide table with many columns. New columns get added as services emit new fields. If the 500 column limit is hit, extra fields go into the AdditionalFields column (a dynamic JSON). Querying on attributes stored in that column might have huge performance and query cost impact. The schema is inconsistent and difficult to discover. You must always filter events in that table by ResourceType and Category. On the other hand, with Resource specific, logs are written to separate tables per resource type and category. Therefore, each table has a well-defined schema and columns are strongly typed. Tables are smaller and faster to query. Queries on these tables are simpler don't need filtering by ResourceType and Category. Performance-wise, they also support faster ingestion and faster querying. They also support selecting different table plans and retention settings for each table. And, more importantly, role-based access control (RBAC) permissions can be applied at table level, allowing you to control access to telemetry in a more granular way. Note: If you want to see any of the images in this article in better quality, click on them to see them in their original size. This can also be configured using Azure CLI command az monitor diagnostic-settings create. Make sure that the --export-to-resource-specific parameter is set to true, which is the equivalent of selecting Resource specific for Destination table in portal UI. Setting this parameter to false, would mean that you want to use AzureDiagnostics as the destination table, which we don't recommend using. Step 5 — Run some workload In this phase the script loads a TPC-H schema and executes workload SQL across different nodes so that you can prove replica capture. Query it in Log Analytics Once the workload completed and data was streamed to Log Analytics, you can open your Log Analytics workspace, and start querying the relevant tables. If you don't know how to start issuing queries in a Log Analytics workspace, refer to Get started with log queries in Azure Monitor Logs. In your Log Analytics workspace, when you select Logs in the resource menu, you can access the Queries hub. By default, it should open automatically unless you have configured it to not show, in which case you can open by selecting Queries hub on the top right corner of the Logs home screen. If you add a filter in the queries hub for Resource type equals Azure Database for PostgreSQL Flexible Server, you'll be able to access multiple examples of queries which might help you get started querying the log categories we support for our service. You can run any of them by selecting Run on the summarization card that describes the query or, if you hover the mouse over the card, you can select Load to editor so that the query is copied over to the active query window, and you can run it or modified it further. Following, there are a few more query examples which can be useful to analyze the workload executed in this experiment. Top queries by total time (across all nodes) To get the list of 10 queries with higher duration from the ones that ran on any of the three nodes. KQL PGSQLQueryStoreRuntime | summarize total_time_ms = sum(TotalExecDurationMs) by QueryId, LogicalServerName | top 10 by total_time_ms desc Results Important: Results might be slightly different on each execution of the experiment. Where queries wait on each node List the most frequent wait events observed on user initiated queries across all nodes. KQL PGSQLQueryStoreWaits | join kind=inner (PGSQLQueryStoreRuntime) on QueryId | summarize total_waits_sampled = sum(Calls) by Event, EventType, LogicalServerName | order by total_waits_sampled desc Results Important: Results might be slightly different on each execution of the experiment. Decode query IDs (join runtime stats with SQL text) Top 20 queries the most frequent wait events observed on user initiated queries across all nodes. KQL PGSQLQueryStoreRuntime | join kind=inner (PGSQLQueryStoreQueryText) on QueryId | where QueryType == 'select' | project LogicalServerName, QueryId, TotalExecDurationMs, QueryText | top 20 by TotalExecDurationMs desc Results Important: Results might be slightly different on each execution of the experiment. Compare primary vs replicas (workload distribution) Find total number of query executions and accumulated duration of all those executions for each node. KQL PGSQLQueryStoreRuntime | summarize execs = sum(Calls), total_time_ms = sum(TotalExecDurationMs) by LogicalServerName | order by total_time_ms desc Results Important: Results might be slightly different on each execution of the experiment. Replica-only hotspots (find what’s slow off the primary) Find top 10 queries executed by their aggregated duration, focusing on what was executed on read replicas only. KQL let Replicas = dynamic(["pgqswait20260505220501-readreplica", "pgqswait20260505220501-cascadereadreplica"]); PGSQLQueryStoreRuntime | where LogicalServerName in (Replicas) | summarize total_time_ms = sum(TotalExecDurationMs) by QueryId, LogicalServerName | top 10 by total_time_ms desc Results Important: Results might be slightly different on each execution of the experiment. QPI now supports query store stats collected on replicas You can now use Query Performance Insight workbooks to analyze query store information not only on your primary server, as you were used to, but you can also get that valuable information on your read replicas. Why replica workload capture is a big deal This is the unlock: you can now answer performance questions in replica-heavy architectures without stitching together partial signals. Per-node truth: see the slow queries on the node where they actually ran (primary vs replica vs cascading replica). Faster root cause: runtime + waits gives you “slow” and “why” in one place. Replica tuning that sticks: identify replica-specific bottlenecks (I/O saturation, lock waits, CPU pressure) and tune with evidence. Centralized observability: export to LAWS so you can build dashboards, alerts, and cross-server comparisons with KQL. Unlock query visibility: Access query text without database permissions. Fine grain control on who can view query text: Using resource specific tables in LAWS, you can decide which users can access the table in which text of the queries is kept. Parameter-aware debugging: sampled parameters can help reproduce issues and explain plan changes, but they’re stored locally in azure_sys and not exported to LAWS. In practice, rely on the primary for parameter inspection (replicas have read-only azure_sys). Operational notes (quick but important) Expect a delay: Query store stats and LAWS ingestion aren’t instant. Give it a few minutes after running workload. Mind retention: Query store retention and Log Analytics retention are separate knobs. Tune them to balance troubleshooting value and cost. Production hygiene: don’t use wide-open firewall rules outside of a demo. Clean up When you’re done, delete the resource group: az group delete --name <RESOURCE_GROUP> --yes --no-wait Bottom line Query store in Azure Database for PostgreSQL flexible server now matches how modern architectures run—across primary, read replicas, and cascading replicas—and LAWS gives you a single place to query, compare, and act.269Views12likes0CommentsWhen PostgreSQL v17 Chooses the Wrong Plan: A Deep Dive into CTEs with Data Skew
Common Table Expressions or CTEs provide temporary named result set that can be referenced within single complex SQL statements. CTE offer great flexibility in breaking down complex SQL queries into multiple readable parts and enables recursion. CTEs allows materialized result set which allows users to refer the result set multiple times. PostgreSQL 17 propagates column statistics from MATERIALIZED CTEs into the parent query, allowing the planner to estimate hash aggregation and join cardinalities more accurately. However, when queries combine highly skewed data distributions with runtime randomness (for example, ORDER BY random() LIMIT 1), this increased planner confidence can sometimes lead to execution plans that are correct in theory but catastrophic for specific runtime values. What is Data Skew and how it affects plans? Data skew is an uneven distribution of values in a column, where one or a few values occur far more often than the rest. To choose an efficient execution plan, PostgreSQL’s planner estimates row counts using statistics such as n_distinct, histograms, and most-common-values (MCV) frequencies stored in pg_statistic. If these estimates are off—especially if selectivity is overestimated—query performance can degrade significantly. Baseline Query and Observed Planner Behavior To gauge the performance impact of data skew with PGv17, we generated data into 3 tables relating to Accounts, Purchases and Audit. This baseline query selects a single random HIGH_RISK account and retrieves matching rows from a large audit_logs table via two materialized CTEs. Original Query with selected_account as materialized ( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1 ), audit_purchases as materialized ( select distinct p.account_id from purchases p join selected_account sa on p.account_id = sa.account_id ) select al.* from audit_logs al where exists ( select 1 from audit_purchases ap where al.account_id = ap.account_id); PostgreSQL 14 Execution Plan PostgreSQL 14 treats materialized CTEs as optimization fences and does not propagate detailed statistics into the parent query. As a result, the planner assumes relatively small CTE result sets and consistently favors nested loop joins with parameterized index scans. While these estimates are not strictly accurate, this conservative planning approach prevents full-table scans and avoids catastrophic performance under extreme data skew. In this workload, PostgreSQL 14 completes the query in approximately 7 ms. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Nested Loop (cost=1412.94..212757485.16 rows=1500600064 width=27) (actual time=5.226..7.192 rows=30 loops=1) Buffers: shared hit=277 read=13 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.460..4.460 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.459..4.459 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.011..3.533 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> HashAggregate (cost=317.44..355.57 rows=3813 width=4) (actual time=4.735..4.741 rows=1 loops=1) Group Key: p.account_id Batches: 1 Memory Usage: 217kB Buffers: shared hit=274 read=1 -> Nested Loop (cost=0.44..292.52 rows=9966 width=4) (actual time=4.722..4.726 rows=10 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.461..4.461 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..192.84 rows=9966 width=4) (actual time=0.260..0.262 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> HashAggregate (cost=85.79..87.79 rows=200 width=4) (actual time=4.743..4.744 rows=1 loops=1) Group Key: ap.account_id Batches: 1 Memory Usage: 40kB Buffers: shared hit=274 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..76.26 rows=3813 width=4) (actual time=4.737..4.742 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..817780.34 rows=24600001 width=27) (actual time=0.480..2.438 rows=30 loops=1) Index Cond: (account_id = ap.account_id) Buffers: shared hit=3 read=12 Planning Time: 1.121 ms Execution Time: 7.253 ms (37 rows) PostgreSQL 17 Execution Plan PostgreSQL 17 correctly estimates that the audit_purchases CTE may contain thousands of rows and that joining it with audit_logs could produce tens of millions of matches. Based on these estimates, the planner selects a Hash Semi Join with a Sequential Scan on audit_logs, which is optimal when many rows are expected to match. However, at runtime the selected account produces only ~30 matching audit rows. Despite this, PostgreSQL must execute the chosen plan fully, scanning millions of rows from disk. This results in an execution time of 521,625 ms (~8.7 minutes). Importantly, this behavior is not a cost estimation bug—the estimates are statistically correct. The failure occurs because the PostgreSQL 17 planner estimates a large matching result set based on the cardinality of values produced by the audit_purchases CTE, and consequently chooses a plan optimized for that estimated volume rather than for the single runtime-selected account. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Semi Join (cost=1691.60..93353890.16 rows=3001396992 width=26) (actual time=86577.858..521625.286 rows=30 loops=1) Hash Cond: (al.account_id = ap.account_id) Buffers: shared hit=28879 read=22040419 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.486..4.487 rows=1 loops=1) Buffers: shared hit=274 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.485..4.486 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=274 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.021..3.551 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> HashAggregate (cost=658.72..673.28 rows=1456 width=4) (actual time=4.907..4.912 rows=1 loops=1) Group Key: p.account_id Batches: 1 Memory Usage: 73kB Buffers: shared hit=277 read=1 -> Nested Loop (cost=0.44..606.86 rows=20742 width=4) (actual time=4.898..4.902 rows=10 loops=1) Buffers: shared hit=277 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.487..4.487 rows=1 loops=1) Buffers: shared hit=274 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=0.410..0.411 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> Seq Scan on audit_logs al (cost=0.00..52082989.92 rows=3001396992 width=26) (actual time=0.012..286755.571 rows=3001400050 loops=1) Buffers: shared hit=28602 read=22040418 -> Hash (cost=29.12..29.12 rows=1456 width=4) (actual time=4.919..4.920 rows=1 loops=1) Buckets: 2048 Batches: 1 Memory Usage: 17kB Buffers: shared hit=277 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..29.12 rows=1456 width=4) (actual time=4.908..4.911 rows=1 loops=1) Buffers: shared hit=277 read=1 Planning: Buffers: shared hit=197 Planning Time: 1.831 ms Execution Time: 521625.433 ms (38 rows) Forced nested loop Disabling hash joins at the session level forces PostgreSQL 17 to revert to a nested loop plan using index scans. This diagnostic step completes in ~8 ms, confirming that index-based execution is sufficient for the actual runtime workload. However, disabling planner features is not suitable for production use and serves only to validate the root cause. QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=1677.62..3361930001.36 rows=3001396992 width=26) (actual time=5.513..8.194 rows=30 loops=1) Buffers: shared hit=277 read=12 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.524..4.524 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.523..4.523 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.014..3.608 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> HashAggregate (cost=658.72..673.28 rows=1456 width=4) (actual time=4.890..4.894 rows=1 loops=1) Group Key: p.account_id Batches: 1 Memory Usage: 73kB Buffers: shared hit=274 read=1 -> Nested Loop (cost=0.44..606.86 rows=20742 width=4) (actual time=4.882..4.885 rows=10 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.525..4.525 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=0.354..0.356 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> HashAggregate (cost=32.76..47.32 rows=1456 width=4) (actual time=4.896..4.899 rows=1 loops=1) Group Key: ap.account_id Batches: 1 Memory Usage: 73kB Buffers: shared hit=274 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..29.12 rows=1456 width=4) (actual time=4.892..4.895 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..1782455.82 rows=52656088 width=26) (actual time=0.615..3.283 rows=30 loops=1) Index Cond: (account_id = ap.account_id) Buffers: shared hit=3 read=11 Planning Time: 0.184 ms Execution Time: 8.252 ms (37 rows) Fix Strategies While the fix isn’t straightforward, we attempt to modify the query plan through query rewrite using following methods: Lateral Join with Offset 0 Limit on Purchases Using subqueries Lateral Join with Offset 0 Adding OFFSET 0 inside a LATERAL subquery introduces an optimizer barrier that prevents join reordering. This forces PostgreSQL to execute a parameterized nested loop, passing the runtime account ID into an index scan on audit_logs. With this rewrite, PostgreSQL 17 executes the query in 6.519 ms, a 99.9988% improvement over the original plan. with selected_account as materialized ( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1 ), audit_purchases as materialized ( select p.account_id from purchases p join selected_account sa on p.account_id = sa.account_id ) select al.* from audit_purchases ap join lateral ( select * from audit_logs al where al.account_id = ap.account_id offset 0) al on true; Execution plan QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=1578.44..42162640233.98 rows=1092192577296 width=26) (actual time=4.980..6.481 rows=300 loops=1) Buffers: shared hit=403 read=12 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.296..4.297 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.295..4.296 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.013..3.357 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> Nested Loop (cost=0.44..606.86 rows=20742 width=4) (actual time=4.585..4.588 rows=10 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.298..4.298 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=0.286..0.287 rows=10 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..414.84 rows=20742 width=4) (actual time=4.587..4.591 rows=10 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..1506157.19 rows=52656088 width=26) (actual time=0.040..0.185 rows=30 loops=10) Index Cond: (account_id = ap.account_id) Buffers: shared hit=129 read=11 Planning: Buffers: shared hit=8 Planning Time: 0.238 ms Execution Time: 6.519 ms (31 rows) Limit on Purchases Applying LIMIT 1 to the audit_purchases CTE bounds its cardinality to a single row. This makes nested loops cheaper than hash joins during planning. PostgreSQL 17 selects an index‑driven execution plan and completes in 7.742 ms, yielding a 99.9985% improvement. with selected_account as materialized ( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1 ), audit_purchases as materialized ( select p.account_id from purchases p join selected_account sa on p.account_id = sa.account_id limit 1 ) select al.* from audit_logs al where exists (select 1 from audit_purchases ap where al.account_id=ap.account_id); Execution plan QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=972.07..933697.99 rows=24600001 width=27) (actual time=5.693..7.686 rows=30 loops=1) Buffers: shared hit=276 read=13 CTE selected_account -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.816..4.817 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.816..4.817 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.014..3.825 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 CTE audit_purchases -> Limit (cost=0.44..0.47 rows=1 width=4) (actual time=5.133..5.134 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Nested Loop (cost=0.44..292.52 rows=9966 width=4) (actual time=5.132..5.132 rows=1 loops=1) Buffers: shared hit=274 read=1 -> CTE Scan on selected_account sa (cost=0.00..0.02 rows=1 width=4) (actual time=4.817..4.817 rows=1 loops=1) Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..192.84 rows=9966 width=4) (actual time=0.313..0.313 rows=1 loops=1) Index Cond: (account_id = sa.account_id) Heap Fetches: 0 Buffers: shared hit=3 read=1 -> HashAggregate (cost=0.02..0.03 rows=1 width=4) (actual time=5.138..5.139 rows=1 loops=1) Group Key: ap.account_id Batches: 1 Memory Usage: 24kB Buffers: shared hit=274 read=1 -> CTE Scan on audit_purchases ap (cost=0.00..0.02 rows=1 width=4) (actual time=5.135..5.135 rows=1 loops=1) Buffers: shared hit=274 read=1 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..686726.47 rows=24600001 width=27) (actual time=0.552..2.534 rows=30 loops=1) Index Cond: (account_id = ap.account_id) Buffers: shared hit=2 read=12 Planning Time: 0.216 ms Execution Time: 7.742 ms (35 rows) Using subqueries Rewriting the query using scalar subqueries converts the selected account ID into an InitPlan. The resulting value is then used as a runtime parameter in index scans against audit_logs. This eliminates join reordering opportunities and guarantees a parameterized access path. Both scalar subquery variants complete in approximately 7.3–7.6 ms, achieving over 99.9986% improvement. Subquery Example 1 select * from audit_logs al where al.account_id in (select p.account_id from purchases p where p.account_id=( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1)); Execution plan QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Nested Loop Semi Join (cost=972.02..1302907.98 rows=24600001 width=27) (actual time=5.257..7.260 rows=30 loops=1) Buffers: shared hit=277 read=12 InitPlan 1 (returns $0) -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.612..4.613 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.611..4.611 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.012..3.674 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 -> Index Scan using idx_account_audit on audit_logs al (cost=0.58..686755.98 rows=24600001 width=27) (actual time=5.067..7.059 rows=30 loops=1) Index Cond: (account_id = $0) Buffers: shared hit=274 read=11 -> Materialize (cost=0.44..242.67 rows=9966 width=4) (actual time=0.006..0.006 rows=1 loops=30) Buffers: shared hit=3 read=1 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..192.84 rows=9966 width=4) (actual time=0.187..0.188 rows=1 loops=1) Index Cond: (account_id = $0) Heap Fetches: 0 Buffers: shared hit=3 read=1 Planning Time: 0.160 ms Execution Time: 7.286 ms (24 rows) Subquery Example 2 select * from audit_logs al where al.account_id=(select distinct p.account_id from purchases p where p.account_id=( select account_id from accounts where account_type = 'HIGH_RISK' order by random() limit 1)); Execution plan QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------ Index Scan using idx_account_audit on audit_logs al (cost=972.04..1507128.65 rows=52656088 width=26) (actual time=5.510..7.637 rows=30 loops=1) Index Cond: (account_id = (InitPlan 2).col1) Buffers: shared hit=277 read=12 InitPlan 2 -> Limit (cost=971.44..971.46 rows=1 width=4) (actual time=4.901..4.902 rows=1 loops=1) Buffers: shared hit=274 read=1 InitPlan 1 -> Limit (cost=971.00..971.00 rows=1 width=12) (actual time=4.455..4.456 rows=1 loops=1) Buffers: shared hit=271 -> Sort (cost=971.00..996.00 rows=10000 width=12) (actual time=4.455..4.455 rows=1 loops=1) Sort Key: (random()) Sort Method: top-N heapsort Memory: 25kB Buffers: shared hit=271 -> Seq Scan on accounts (cost=0.00..921.00 rows=10000 width=12) (actual time=0.018..3.522 rows=10000 loops=1) Filter: (account_type = 'HIGH_RISK'::text) Rows Removed by Filter: 40000 Buffers: shared hit=271 -> Index Only Scan using idx_purchases_account on purchases p (cost=0.44..399.42 rows=20742 width=4) (actual time=4.900..4.900 rows=1 loops=1) Index Cond: (account_id = (InitPlan 1).col1) Heap Fetches: 0 Buffers: shared hit=274 read=1 Planning: Buffers: shared hit=4 Planning Time: 0.154 ms Execution Time: 7.663 ms (25 rows) Query execution summary The table below summarizes execution time improvements across all tested rewrites relative to the original PostgreSQL 17 and 14 query execution plans. Setup / Rewrite Execution Time (ms) Speed‑up Factor (×) PG17 original (Hash Semi Join + Seq Scan) 521,625.433 1.0× PG14 original query shape 7.253 71,919× Forced nested loop (hashjoin off – diagnostic) 8.252 63,212× LATERAL JOIN + OFFSET 0 6.519 80,016× LIMIT on purchases 7.742 67,376× Subquery – IN with InitPlan (Example 1) 7.286 71,593× Subquery – scalar subquery (Example 2) 7.663 68,071× Key Takeaways While PostgreSQL 17’s planner improvements are correct, intentional, and beneficial for most workloads. However, queries that combine CTEs, runtime randomness, and heavily skewed data can cause the planner to select globally optimal plans that are locally inefficient. When runtime values drastically narrow result sets, query rewrites that force parameterized index scans—such as LATERAL joins, scalar subqueries, or bounded CTEs—provide predictable and stable performance.241Views0likes0CommentsPotential Consequences of Using Postgres as a Job Queue
Introduction At small scale, using Postgres as a job queue is totally fine, and I’d even say it’s the right call. Fewer moving parts, one less system to manage, ACID guarantees on your jobs. What’s not to love? The problem is that “small scale” has a ceiling, and the ceiling is lower than most people expect. When you’ve got thousands of concurrent workers hammering a jobs table with SELECT ... FOR UPDATE SKIP LOCKED , things start to behave in ways that aren’t obvious from the application layer. CPU usage creeps up. Also vacuum sometimes can’t keep up. Finally, in the wait event stats, you start seeing ominous entries like LWLock:MultiXactSLRU stacking up across many backends. This pattern has tripped up teams more than a few times, and it usually plays out the same way: everything works fine in dev and staging, then goes off a cliff in production once the concurrency gets real. So let’s dig into why this happens, and what the alternatives look like. The Typical Pattern When using Postgres as a job queue, the standard approach looks something like this: CREATE TABLE job_queue ( id bigserial PRIMARY KEY, status text NOT NULL DEFAULT 'pending', payload jsonb NOT NULL, created_at timestamptz NOT NULL DEFAULT now(), locked_by text, locked_at timestamptz ); CREATE INDEX idx_job_queue_status ON job_queue (status) WHERE status = 'pending'; Workers grab jobs with: UPDATE job_queue SET status = 'processing', locked_by = 'worker-42', locked_at = now() WHERE id = ( SELECT id FROM job_queue WHERE status = 'pending' ORDER BY created_at LIMIT 1 FOR UPDATE SKIP LOCKED ) RETURNING *; And then mark them done: UPDATE job_queue SET status = 'completed' WHERE id = $1; Some users may DELETE the row entirely. Either way, the lifecycle is: insert, lock-and-update, update-or-delete. Repeated thousands of times per second. At low concurrency, this works very smoothly. SKIP LOCKED means workers don’t block each other waiting for the same row. Postgres handles the locking, visibility, and ordering. It’s elegant. So where does it break? The MultiXact SLRU Problem When multiple transactions hold locks on the same row, Postgres stores the set of lockers as a MultiXact ID – a pointer into a side structure under pg_multixact/ . With SELECT ... FOR UPDATE SKIP LOCKED , users might think MultiXacts aren’t involved – after all, SKIP LOCKED is supposed to avoid contention. But in practice, with many concurrent workers all racing to lock rows, there are brief windows where multiple transactions reference the same row before one of them “wins” and the others skip. If you combine this with any FOR SHARE or FOR KEY SHARE locks (which are commonly created implicitly by foreign key checks), MultiXact IDs start accumulating quickly. The MultiXact data lives in SLRU buffers (Simple Least Recently Used) – a small, fixed-size shared memory cache. When backends need to read or write MultiXact data, they acquire LWLocks to access these buffers. Under high concurrency, this becomes a bottleneck: wait_event_type | wait_event -----------------+------------------- LWLock | MultiXactMemberSLRU LWLock | MultiXactOffsetSLRU You’ll see dozens or hundreds of backends piled up on these waits. The SLRU cache is small (by design – it’s a fixed number of pages in shared memory), and when the working set of MultiXact lookups exceeds what fits in the cache, you get constant eviction and re-reads from disk. Every lock acquisition and release on a job row potentially triggers a MultiXact SLRU lookup, and at thousands of concurrent sessions, those lookups serialize on LWLocks. The result: CPU gets pegged, throughput collapses, and latency spikes – not because the queries are expensive, but because the locking infrastructure itself is overwhelmed. Bloat: The Silent Killer The other side of this coin is table and index bloat. Every job row goes through multiple updates (and possibly a delete), and each of those operations creates a new tuple version in the heap. The old versions stick around until VACUUM cleans them up. On a busy job queue table: Dead tuples accumulate faster than autovacuum can clean them. By the time autovacuum finishes one pass, tens of thousands of new dead tuples have appeared. The table grows and grows. Index bloat compounds the problem. Every index on the table also accumulates dead entries. The partial index on status = 'pending' gets thrashed especially hard, since rows constantly enter and leave that condition. Sequential scans get slower. As the table bloats, even index scans start doing more I/O because the heap pages are sparsely populated. Vacuum reclaims space at the end of the table, but can’t reclaim space in the middle (unless the pages are completely empty). Job queue tables can grow to tens of gigabytes when the actual “live” data was only a few megabytes. It makes everything slower: scans, vacuum, even pg_dump. You can mitigate this by running vacuum more aggressively (lower autovacuum_vacuum_scale_factor , higher autovacuum_vacuum_cost_limit ), or by partitioning the table and dropping old partitions. But at some point, you’re fighting the fundamental mismatch between MVCC’s design goals and the write pattern of a job queue. CPU and Lock Overhead Beyond the SLRU contention and bloat, there’s just the raw overhead of using Postgres’s full transactional machinery for what is essentially a FIFO dispatch operation: Every lock/unlock is a full WAL-logged transaction. Grabbing a job writes WAL. Marking it complete writes WAL. Deleting it writes WAL. On a system processing thousands of jobs per second, the WAL volume from the job queue alone can saturate your wal_writer and checkpoint processes. SKIP LOCKED still touches rows. The name suggests rows are skipped, but Postgres still has to find them, check their lock status, and move on. With high concurrency, many workers end up scanning past the same locked rows before finding one they can claim. This is wasted CPU. Snapshot management overhead also becomes an issue. Each transaction needs a consistent snapshot, and with thousands of concurrent transactions, the ProcArray (the structure that tracks active transactions) becomes a contention point itself. You might see LWLock:ProcArrayLock waits alongside the MultiXact ones. Vacuum contention. While vacuum is cleaning up dead tuples, it needs locks too. On a table under constant write pressure, vacuum can interfere with the workers and vice versa. I’ve seen systems where disabling autovacuum on the job queue table improved throughput in the short term. Better Alternatives So what should you use instead? It depends on your requirements, but there are several options that handle high-throughput job dispatch more gracefully than a Postgres table. Advisory Locks (Staying in Postgres) If you want to stay within Postgres and avoid adding infrastructure, advisory locks are worth considering for certain queue patterns. Instead of locking rows, you lock on an abstract numeric key: -- Worker tries to acquire a lock on the job ID SELECT pg_try_advisory_lock(id) FROM job_queue WHERE status = 'pending' ORDER BY created_at LIMIT 1; Advisory locks are lightweight – they don’t touch the heap, don’t create MultiXact entries, and don’t generate dead tuples. They live entirely in shared memory. The trade-off is that you lose the atomicity of FOR UPDATE SKIP LOCKED : you need to handle the case where a lock is acquired but the job processing fails, and you need to release the lock explicitly (or rely on session-end cleanup). This approach works well when the queue depth is manageable and you want to avoid the MVCC overhead. But it’s still Postgres, so you’re still subject to connection limits, ProcArray overhead, and general resource contention at very high session counts. pgq (Skytools) pgq is purpose-built for exactly this problem. It’s a queue implementation that sits inside Postgres but uses a batching model that avoids most of the row-level locking and MVCC pitfalls. Events are written to a queue table, but consumers read them in batches and the queue maintenance is done via a ticker process that manages rotation. The key advantages: No row-level contention. Consumers don’t lock individual rows. Built-in batch processing. Events are consumed in chunks, reducing transaction overhead. Efficient cleanup. Old events are rotated out rather than vacuumed row-by-row. The downside is that pgq is not as actively maintained as it once was, and it adds operational complexity (the ticker daemon, consumer registration, etc.). But for teams already deep in the Postgres ecosystem, it’s a battle-tested option. Redis For many teams, Redis is the natural choice for job queues. Using Redis lists (BRPOPLPUSH or the Streams API), you get: Sub-millisecond dispatch latency. No disk I/O, no MVCC, no vacuum. Atomic pop operations. Workers grab jobs without any locking protocol. Simple scaling. Redis handles thousands of concurrent consumers trivially. The trade-off is durability. Redis can persist to disk, but it’s not ACID. If Redis crashes between a pop and the job completing, you might lose or duplicate work (though Redis Streams with consumer groups mitigate this significantly). For most job queue use cases, at-least-once delivery is acceptable, and Redis does that well. Kafka For truly high-throughput, distributed workloads, Apache Kafka is the heavyweight option. Kafka partitions give you parallel consumption with ordering guarantees per partition, durable storage, and replay capability. It’s the right tool when: You need to process thousands of events per second Multiple consumers need to read the same events You want event replay or audit trails Your architecture is already event-driven The operational overhead is nontrivial – ZooKeeper (or KRaft), brokers, topic management, consumer group coordination. But for teams already running Kafka for other reasons, adding a job queue topic is practically free. Choosing the Right Tool Here’s a rough decision guide: Under 100 concurrent workers, simple jobs, Postgres with SKIP LOCKED is fine Moderate concurrency, want to stay in Postgres, Advisory locks or pgq High throughput, low-latency dispatch, Redis (Lists or Streams) Massive scale, distributed, event replay, Kafka Many teams that start with Postgres (reasonably) hit scaling problems and then try to fix Postgres rather than recognizing that the workload has outgrown the tool. They throw more autovacuum workers at it, increase max_connections , add connection poolers – all of which help at the margins, but don’t address the fundamental issue: Postgres’s MVCC and locking machinery wasn’t designed for this access pattern at high concurrency. Conclusion Postgres is great, but it can’t be the best tool for every job. Using it as a job queue is a perfectly valid choice when your scale is modest. But when you’re running thousands of concurrent workers, the combination of MultiXact SLRU contention, heap bloat, vacuum pressure, and raw locking overhead will eventually push you toward a purpose-built solution. The good news is that you don’t have to rip out everything. Advisory locks can buy you headroom without adding infrastructure. Redis can handle dispatch while Postgres keeps owning the data. And if you’re already using Kafka, a job topic is a natural fit. Take your pick – there are many queueing options out there!310Views4likes0CommentsStop Experimenting, Start Building: AI Apps & Agents Dev Days Has You Covered
The AI landscape has shifted. The question is no longer “Can we build AI applications?” it’s “Can we build AI applications that actually work in production?” Demos are easy. Reliable, scalable, resilient AI systems that handle real-world complexity? That’s where most teams struggle. If you’re an AI developer, software engineer, or solution architect who’s ready to move beyond prototypes and into production-grade AI, there’s a series built specifically for you. What Is AI Apps & Agents Dev Days? AI Apps & Agents Dev Days is a monthly technical series from Microsoft Reactor, delivered in partnership with Microsoft and NVIDIA. You can explore the full series at https://developer.microsoft.com/en-us/reactor/series/s-1590/ This isn’t a slide deck marathon. The series tagline says it best: “It’s not about slides, it’s about building.” Each session tackles real-world challenges, shares patterns that actually work, and digs into what’s next in AI-driven app and agent design. You bring your curiosity, your code, and your questions. You leave with something you can ship. The sessions are led by experienced engineers and advocates from both Microsoft and NVIDIA, people like Pamela Fox, Bruno Capuano, Anthony Shaw, Gwyneth Peña-Siguenza, and solutions architects from NVIDIA’s Cloud AI team. These aren’t theorists; they’re practitioners who build and ship the tools you use every day. What You’ll Learn The series covers the full spectrum of building AI applications and agent-based systems. Here are the key themes: Building AI Applications with Azure, GitHub, and Modern Tooling Sessions walk through how to wire up AI capabilities using Azure services, GitHub workflows, and the latest SDKs. The focus is always on code-first learning, you’ll see real implementations, not abstract architecture diagrams. Designing and Orchestrating AI Agents Agent development is one of the series’ strongest threads. Sessions cover how to build agents that orchestrate long-running workflows, persist state automatically, recover from failures, and pause for human-in-the-loop input, without losing progress. For example, the session “AI Agents That Don’t Break Under Pressure” demonstrates building durable, production-ready AI agents using the Microsoft Agent Framework, running on Azure Container Apps with NVIDIA serverless GPUs. Scaling LLM Inference and Deploying to Production Moving from a working prototype to a production deployment means grappling with inference performance, GPU infrastructure, and cost management. The series covers how to leverage NVIDIA GPU infrastructure alongside Azure services to scale inference effectively, including patterns for serverless GPU compute. Real-World Architecture Patterns Expect sessions on container-based deployments, distributed agent systems, and enterprise-grade architectures. You’ll learn how to use services like Azure Container Apps to host resilient AI workloads, how Foundry IQ fits into agent architectures as a trusted knowledge source, and how to make architectural decisions that balance performance, cost, and scalability. Why This Matters for Your Day Job There’s a critical gap between what most AI tutorials teach and what production systems actually require. This series bridges that gap: Production-ready patterns, not demos. Every session focuses on code and architecture you can take directly into your projects. You’ll learn patterns for state persistence, failure recovery, and durable execution — the things that break at 2 AM. Enterprise applicability. The scenarios covered — travel planning agents, multi-step workflows, GPU-accelerated inference — map directly to enterprise use cases. Whether you’re building internal tooling or customer-facing AI features, the patterns transfer. Honest trade-off discussions. The speakers don’t shy away from the hard questions: When do you need serverless GPUs versus dedicated compute? How do you handle agent failures gracefully? What does it actually cost to run these systems at scale? Watch On-Demand, Build at Your Own Pace Every session is available on-demand. You can watch, pause, and build along at your own pace, no need to rearrange your schedule. The full playlist is available at This is particularly valuable for technical content. Pause a session while you replicate the architecture in your own environment. Rewind when you need to catch a configuration detail. Build alongside the presenters rather than just watching passively. What You’ll Walk Away Wit After working through the series, you’ll have: Practical agent development skills — how to design, orchestrate, and deploy AI agents that handle real-world complexity, including state management, failure recovery, and human-in-the-loop patterns Production architecture patterns — battle-tested approaches for deploying AI workloads on Azure Container Apps, leveraging NVIDIA GPU infrastructure, and building resilient distributed systems Infrastructure decision-making confidence — a clearer understanding of when to use serverless GPUs, how to optimise inference costs, and how to choose the right compute strategy for your workload Working code and reference implementations — the sessions are built around live coding and sample applications (like the Travel Planner agent demo), giving you starting points you can adapt immediately A framework for continuous learning — with new sessions each month, you’ll stay current as the AI platform evolves and new capabilities emerge Start Building The AI applications that will matter most aren’t the ones with the flashiest demos — they’re the ones that work reliably, scale gracefully, and solve real problems. That’s exactly what this series helps you build. Whether you’re designing your first AI agent system or hardening an existing one for production, the AI Apps & Agents Dev Days sessions give you the patterns, tools, and practical knowledge to move forward with confidence. Explore the series at https://developer.microsoft.com/en-us/reactor/series/s-1590/ and start watching the on-demand sessions at the link above. The best time to level up your AI engineering skills was yesterday. The second-best time is right now and these sessions make it easy to start.Building a Restaurant Management System with Azure Database for MySQL
In this hands-on tutorial, we'll build a Restaurant Management System using Azure Database for MySQL. This project is perfect for beginners looking to understand cloud databases while creating something practical.1.3KViews5likes5CommentsCascading Read Replicas Now Generally Available!
We’re excited to announce the General Availability of cascading read replicas in Azure Database for PostgreSQL. This capability allows you to create read replicas for your Azure Database for PostgreSQL instance not only from a primary server, but also from existing read replicas, enabling multi‑level replication chains. Coordinating read‑heavy database workloads across multiple regions can be challenging, especially when you’re trying to deliver low‑latency read response experiences to users spread across different geographic locations. One effective way to address this is by placing read replicas closer to where your users are, allowing applications to serve read requests with significantly reduced latency and improved performance. What are cascading read replicas? With cascading read replicas, you can scale read‑intensive workloads more effectively, distribute read traffic efficiently, and support advanced deployment topologies such as globally distributed applications. Each read replica can act as a source for additional replicas, forming a tree‑like replication structure. For example, if your primary server is deployed in one region, you can create direct replicas in nearby regions and then cascade additional replicas to more distant locations. This approach helps spread read traffic evenly while minimizing latency for users around the world. We support up to 2 levels of replication with this feature. Level 1 will be all the read replicas and level 2 will be cascading read replicas. Why use cascading read replicas? Improved scalability Cascading read replicas support multi‑level replication, making it easier to handle high volumes of read traffic without overloading a single instance by scaling up to 30 read replicas. Geographic distribution By placing replicas closer to your global user base, you can significantly reduce read latency and deliver faster, more responsive application experiences. Efficient read traffic distribution Distributing read workloads across multiple replicas helps balance load, improving overall performance and reliability. Additionally, cascading read replicas offer operational flexibility. If you observe replication lag, you can perform a switchover operation between a cascading read replica with its source or intermediate replica, helping you maintain optimal performance and availability for your replicas. How does replication work with cascading read replicas? The primary server acts as a source for the read replica. Data is asynchronously replicated to these replicas. When we add cascading replicas, the previous replicas act as a data source for replication. In the diagram above, “primary-production-server” is the primary server with three read replicas. One of these replicas, “readreplica01”, serves as the source for another read replica, “readreplica11” which is a cascading read replica. With cascading read replicas, you can add up to five read replicas per source and replicate data across two levels, as shown in the diagram. This allows you to create up to 30 read replicas in total five read replicas directly from the primary server, and up to 25 additional replicas at the second level (each second-level replica can have up to five read replicas). If you notice replication lag between an intermediate read replica and a cascading read replica, you can use a switchover operation to swap “readreplica01” and “readreplica11”, helping reduce the impact of lag. To learn more about cascading read replicas, please refer to our documentation: Cascading read replicas Deploying cascading read replicas on Azure portal Navigate to the “Replication” tab and then click on “Create replica” highlighted in red as shown below: After creating a read replica as the below screenshot shows that you have 1 read replica that is attached to the primary instance. Click on the created replica and navigate to the replication tab, source server is “read-replica-01” and we will be creating a cascading read replica under this. Once cascading read replica is created you can see the role of “read-replica-01” has now changed to Source, Replica. You can perform site swap operation by clicking on the promote button for cascading read replica. Deploy cascading read replica with terraform: Before you start, make sure you have: An existing primary PostgreSQL Flexible Server At least one read replica already created from the primary AzureRM provider with latest version Proper permissions on the Azure subscription and resource group Configure the AzureRM Provider: Start by configuring the AzureRM provider in your Terraform project. terraform { required_providers { azurerm = { source = "hashicorp/azurerm" version = "~> 3.80" } } } provider "azurerm" { features {} } Reference the existing read replica server using the data block to reference the replica server. data "azurerm_postgresql_flexible_server" "source_replica" { name = "my-read-replica-1" resource_group_name = "my-resource-group" } Now create a new PostgreSQL Flexible Server and point it to the replica using create_source_server_id. resource "azurerm_postgresql_flexible_server" "cascading_replica" { name = "my-cascading-replica" resource_group_name = "my-resource-group" location = data.azurerm_postgresql_flexible_server.source_replica.location version = data.azurerm_postgresql_flexible_server.source_replica.version delegated_subnet_id = data.azurerm_postgresql_flexible_server.source_replica.delegated_subnet_id private_dns_zone_id = data.azurerm_postgresql_flexible_server.source_replica.private_dns_zone_id create_mode = "Replica" create_source_server_id = data.azurerm_postgresql_flexible_server.source_replica.id storage_mb = 32768 sku_name = "Standard_D4s_v3" depends_on = [ data.azurerm_postgresql_flexible_server.source_replica ] } Apply the Terraform Configuration terraform init terraform plan terraform apply Key Considerations Cascading read replicas allow for up to 5 read replicas and two levels of replication. Creating cascading read replicas is supported in PostgreSQL version 14 and above. Promote operation is not supported for intermediate read replicas with cascading read replicas. Conclusion Cascading read replicas in Azure Database for PostgreSQL offer a scalable way to distribute your read traffic across the same and different regions, reducing the read workload on primary database. For globally distributed applications, this can improve read latency as well as resilience and performance. This design supports horizontal scaling as your application demand grows, ensuring you can handle a high volume of read requests without compromising speed. Get started with this feature today and scale your read workloads.319Views1like0CommentsWhy Windows Should Adopt ReFS as a Bootable Filesystem
ReFS could become a bootable filesystem — it only needs a few missing layers. No need to copy NTFS, just implement what the Windows boot process requires. Key missing pieces: System‑level journaling (not only metadata) Full hardlink + extended attribute support EFS, ACLs, USN Journal for security + Windows Update Boot‑critical atomicity for safe system file updates Bootloader‑compatible APIs (BCD, BitLocker pre‑boot, WinRE, Secure Boot) Goals: Use NTFS as a reference map, add the missing capabilities to ReFS, and optimize them using ReFS features (copy‑on‑write, integrity streams, block cloning). Result: A modern, resilient filesystem that can finally boot Windows - without losing its benefits.171Views1like7CommentsHandling Unique Constraint Conflicts in Logical Replication
Authors: Ashutosh Sharma, Senior Software Engineer, and Gauri Kasar, Product Manager Logical replication can keep your PostgreSQL environments in sync, helping replicate selected tables with minimal impact on the primary workload. But what happens when your subscriber hits a duplicate key error and replication grinds to a halt? If you’ve seen a unique‑constraint violation while replicating between Azure Database for PostgreSQL servers, you’re not alone. This blog covers common causes, prevention tips, and practical recovery options. In PostgreSQL logical replication, the subscriber can fail with a unique-constraint error when it tries to apply a change that would create a duplicate key. duplicate key value violates unique constraint Understanding why this happens? When an INSERT or UPDATE would create a value that already exists in a column (or set of columns) protected by a UNIQUE constraint (including a PRIMARY KEY). In logical replication, this most commonly occurs because of local writes on the subscriber or if the table is being subscribed from multiple publishers. These conflicts are resolved on the subscriber side. Local writes on the subscriber: a row with the same primary key/unique key is inserted on the subscriber before the apply worker processes the corresponding change from the publisher. Multi-origin / multi-master without conflict-free keys: two origins generate (or replicate) the same unique key. Initial data synchronization issues: the subscriber already contains data when the subscription is created with initial copy enabled, resulting in duplicate inserts during the initial table sync. How to avoid this? Avoid local writes on subscribed tables (treat the subscriber as read-only for replicated relations). Avoid subscribing to the same table from multiple publishers unless you have explicit conflict handling and a conflict-free key design. Enabling server logs can help you identify and troubleshoot unique‑constraint conflicts more effectively. Refer to the official documentation to configure and access PostgreSQL logs. How to handle conflicts (recovery options) Option 1: Delete the conflicting row on the subscriber Use the subscriber logs to identify the key (or row) causing the conflict, then delete the row on the subscriber with a DELETE statement. Resume apply and repeat if more conflicts appear. Option 2: Use conflict logs and skip the conflicting transaction (PostgreSQL 17+) Starting with PostgreSQL 17, logical replication provides detailed conflict logging on the subscriber, making it easier to understand why replication stopped and which transaction caused the failure. When a replicated INSERT would violate a non‑deferrable unique constraint on the subscriber for example, when a row with the same key already exists the apply worker detects this as an insert_exists conflict and stops replication. In this case, PostgreSQL logs the conflict along with the transaction’s finish LSN, which uniquely identifies the failing transaction. ERROR: conflict detected on relation "public.t2": conflict=insert_exists ... in transaction 754, finished at 0/034F4090 ALTER SUBSCRIPTION <subscription_name> SKIP (lsn = '0/034F4090'); Option 3: Rebuild (re-sync) the table Rebuilding (re‑syncing) a table is the safest and most deterministic way to resolve logical replication conflicts caused by pre‑existing data differences or local writes on the subscriber. This approach is especially useful when a table repeatedly fails with unique‑constraint violations and it is unclear which rows are out of sync. Step 1 (subscriber): Disable the subscription. ALTER SUBSCRIPTION <subscription_name> DISABLE; Step 2 (subscriber): Remove the local copy of the table so it can be re-copied. TRUNCATE TABLE <conflicting_table>; Step 3 (publisher): Ensure the publication will (re)send the table (one approach is to recreate the publication entry for that table). ALTER PUBLICATION <pub_with_conflicting_table> DROP TABLE <conflicting_table>; CREATE PUBLICATION <pub_with_conflicting_table_rebuild> FOR TABLE <conflicting_table>; Step 4 (subscriber): Create a new subscription (or refresh the existing one) to re-copy the table. CREATE SUBSCRIPTION <sub_rebuild> CONNECTION '<connection_string>' PUBLICATION <pub_with_conflicting_table_rebuild>; Step 5 (subscriber): Re-enable the original subscription (if applicable). ALTER SUBSCRIPTION <subscription_name> ENABLE; Conclusion In most cases, these conflicts occur due to local changes on the subscriber or differences in data that existed before logical replication was fully synchronized. It is recommended to avoid direct modifications on subscribed tables and ensure that the replication setup is properly planned, especially when working with tables that have unique constraints.195Views1like0Comments