query optimization
1 TopicPostgreSQL Query Rewriting and Subqueries
In PostgreSQL, the way a query is written has a direct impact on the execution plan the optimizer selects—and consequently on the overall performance of the system. Although the optimizer evaluates multiple plan alternatives, it can only consider strategies that the SQL structure logically allows. Suboptimal query patterns often force PostgreSQL into performing unnecessary work, such as processing more rows than needed, repeating lookups row‑by‑row or applying expensive deduplication or aggregation operations after an overly large join. By rewriting queries in more optimizer‑friendly forms, we can drastically reduce the amount of data scanned, improve join efficiency, eliminate redundant operations, and enable PostgreSQL to choose faster, more scalable execution paths. The scenarios in this document demonstrate how simple transformations—such as replacing joins with semi‑joins, pre‑aggregating large tables, or converting correlated subqueries into set‑based joins—allow PostgreSQL to leverage more efficient physical operators and achieve significant performance gains. This document explains common PostgreSQL scenarios, where rewriting a query into a subquery (or rewriting a subquery into a different form) can significantly improve performance. Each scenario includes an explanation, example rewrite, and a self-contained test script you can run. The below test was performed on a 4 vcore SKU with 128 GB storage disk. Test setup/ script: DROP TABLE IF EXISTS orders; DROP TABLE IF EXISTS customers; CREATE TABLE customers ( customer_id int PRIMARY KEY, region text NOT NULL ); CREATE TABLE orders ( order_id bigserial PRIMARY KEY, customer_id int NOT NULL REFERENCES customers(customer_id), created_at timestamptz NOT NULL ); INSERT INTO customers SELECT g, CASE WHEN g % 5 = 0 THEN 'E' ELSE 'W' END FROM generate_series(1,200000) g; INSERT INTO orders(customer_id, created_at) SELECT 1 + ((g*37) % 200000), now() - ((g % 365) || ' days')::interval FROM generate_series(1,5000000) g; CREATE INDEX orders_customer_id_idx ON orders(customer_id); ANALYZE customers; ANALYZE orders; -------------------------------------------------- Scenario 1: Semi-join filtering (JOIN → EXISTS) -------------------------------------------------- When it helps: Use this when you only need to check for the existence of related rows and do not need columns from the joined table. EXISTS allows PostgreSQL to use a semi-join strategy and avoids row multiplication. Before: SELECT DISTINCT c.customer_id FROM customers c JOIN orders o ON o.customer_id = c.customer_id WHERE c.region = 'E'; Query plan: Explanation: This query forces PostgreSQL to: • Perform a full join between customers and orders • Produce multiple rows per customer (one for each matching order) • Apply a Unique/Aggregate step to remove duplicates This results in: More rows flowing through the join Higher memory usage An expensive deduplication step Possible large hash tables or repeated nested loop iterations After: SELECT c.customer_id FROM customers c WHERE c.region = 'E' AND EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id ); Query plan: Explanation: The rewritten query using EXISTS, PostgreSQL can switch to a Semi Join, which has two crucial advantages: Stops scanning orders after the first match A semi‑join checks only for the existence of at least one matching row — it does not need all of them. Early termination dramatically reduces I/O. No need for DISTINCT Because no row multiplication occurs, the result is naturally unique. Why performance improves Fewer rows scanned No row explosion No sort/aggregate for DISTINCT Semi‑join is cheaper both in CPU and memory Net effect: Large reduction in join work and downstream processing. -------------------------------------------------- Scenario 2: Pre-aggregate using a subquery -------------------------------------------------- When it helps: If you only need aggregated results from a large fact table, aggregate first in a subquery and then join. This reduces join cardinality and memory usage. Before: SELECT c.region, count(*) AS order_ct FROM customers c JOIN orders o ON o.customer_id = c.customer_id GROUP BY c.region; Query plan: Explanation: The original query aggregates at the end, after joining the entire orders table with customers. The join processes millions of rows Grouping is done on the expanded join result Memory consumption grows significantly Hash join must handle a very large input After: SELECT c.region, sum(x.order_ct) AS order_ct FROM customers c JOIN ( SELECT customer_id, count(*) AS order_ct FROM orders GROUP BY customer_id ) x ON x.customer_id = c.customer_id GROUP BY c.region; Query plan: Explanation: The rewritten query aggregates orders before joining This reduces the orders table from millions of rows to just thousands (one per customer). PostgreSQL can now join a tiny, aggregated set with customers. PostgreSQL’s plan changes accordingly: • HashAggregate on orders (small output) • Join happens on significantly reduced data • Final grouping is trivial because input has already been summarized Why performance improves Drastic reduction in join input size Much smaller hash tables Fewer rows grouped at the final stage Less disk spill risk Better CPU and memory efficiency Net effect: Doing heavy work early shrinks the workload for the rest of the query. -------------------------------------------------- Scenario 3: Correlated scalar subquery → JOIN -------------------------------------------------- When it helps: Correlated scalar subqueries can behave like N+1 queries. Rewriting them as joins allows PostgreSQL to use more efficient join strategies. Before: SELECT o.order_id, (SELECT c.region FROM customers c WHERE c.customer_id = o.customer_id) AS region FROM orders o; Query plan: Explanation: Correlated Scalar Subquery A correlated subquery runs once per row of the outer table (orders). PostgreSQL has no choice but to generate a parameterized Nested Loop, which means: Scan one order Perform an index lookup into customers Repeat for every order The performance degrades linearly with table size. After: SELECT o.order_id, c.region FROM orders o JOIN customers c ON c.customer_id = o.customer_id; Query plan: Explanation: The rewritten query is a standard join removes the row‑by‑row dependency. Now PostgreSQL can: Scan both tables once Build a hash table on customers Use a Hash Join or Merge Join to match rows efficiently This transforms the execution pattern from row‑by‑row lookups to a set‑based join, which is far more efficient. Why performance improves Eliminates repeated index probes Replaces O(N) loops with a single O(N) scan + O(N) hash join Better cache utilization Fully parallelizable (semi‑correlated loops are not) Net effect: Orders and customers are processed together in a single, optimized join rather than tens or hundreds of thousands of micro‑queries. Conclusion: The rewritten queries enable PostgreSQL to choose far more efficient physical operations—such as semi‑joins, early aggregation, and set‑based hash/merge joins—dramatically reducing row processing, memory usage, and repetitive work. Overall, these improvements streamline execution paths, allowing the optimizer to operate on smaller, cleaner datasets and produce faster, more scalable query plans.