Blog Post

Microsoft Blog for PostgreSQL
12 MIN READ

Understanding Hash Join Memory Usage and OOM Risks in PostgreSQL

FranciscoPardillo's avatar
Mar 09, 2026

This article demonstrates how PostgreSQL queries can consume significant memory—even when parameters such as work_mem or hash_mem_multiplier are configured with conservative values. The goal is to help DBAs and developers understand a common source of unexpected Out‑Of‑Memory (OOM) conditions: Hash Join operations on extremely low‑cardinality data distributions. The behavior is not specific to Azure Database for PostgreSQL. It can be reproduced identically on any PostgreSQL installation, including on‑premises environments and containerized deployments.

Background: Why Memory Usage May Exceed work_mem

work_mem is commonly assumed to be a hard upper bound on per‑query memory usage.

However, for Hash Join operations, memory consumption depends not only on this parameter but also on:

✅ Data cardinality

✅ Hash table internal bucket distribution

✅ Join column characteristics

✅ Number of batches created

✅ Parallel workers involved

Under low‑cardinality conditions, a Hash Join may place an extremely large number of rows into very few buckets—sometimes a single bucket. This causes unexpectedly large memory allocations that exceed the nominal work_mem limit.

Background: What work_mem really means for Hash Joins

  • work_mem controls the amount of memory available per operation (e.g., a sort or a hash) per node (and per parallel worker) before spilling to disk. Hash operations can additionally use hash_mem_multiplier×work_mem for their hash tables. [postgresql.org], [postgresqlco.nf]
  • The Hash Join algorithm builds a hash table for the “build/inner” side and probes it with the “outer” side. The table is split into buckets; if it doesn’t fit in memory, PostgreSQL partitions work into batches (spilling to temporary files). Skewed distributions (e.g., very few distinct join keys) pack many rows into the same bucket(s), exploding memory usage even when work_mem is small. [postgrespro.com], [interdb.jp]
  • In EXPLAIN (ANALYZE) you’ll see Buckets:, Batches:, and Memory Usage: on the Hash node; Batches > 1 indicates spilling/partitioning. [postgresql.org], [thoughtbot.com]
  • The default for hash_mem_multiplier is version‑dependent (introduced in PG13; 1.0 in early versions, later 2.0). Tune with care; it scales the memory that hash operations may consume relative to work_mem. [pgpedia.info]

 

A safe, reproducible demo (containerized community PostgreSQL)

The goal is to show that data distribution alone can drive order(s) of magnitude difference in hash table memory, using conservative settings.

In order to simulate the behavior we´ll use pg_hint_plan extension to guide the execution plans and create some data distribution that may not have a good application logic, just to force and show the behavior.

  • Start PostgreSQL 16 container

docker run --name=postgresql16.8 -p 5414:5432 -e POSTGRES_PASSWORD=<password> -d postgres:16.8
docker exec -it postgresql16.8 /bin/bash -c "apt-get update -y;apt-get install procps -y;apt-get install postgresql-16-pg-hint-plan -y;apt-get install vim -y;apt-get install htop -y"
docker exec -it postgresql16.8 /bin/bash

vi /var/lib/postgresql/data/postgresql.conf
-- Adding pg_hint_plan to shared_preload_libraries
psql -h localhost -U postgres
create extension pg_hint_plan;

docker stop postgresql16.8
docker start postgresql16.8

To connect to our docker container we use:

psql -h localhost -p 5414 -U postgres

 

  • Connect and apply conservative session-level settings

We’ll discourage other join methods so the planner prefers Hash Join—without needing any extension.

set hash_mem_multiplier=1;
set max_parallel_workers=0;
set max_parallel_workers_per_gather=0;
set enable_parallel_hash=off;
set enable_material=off;
set enable_sort=off;
set pg_hint_plan.debug_print=verbose;
set client_min_messages=notice;
set pg_hint_plan.enable_hint_table=on;
  • Create tables and load data

We´ll create two tables for the join, table_1, with a single row, table_h initially with 10mill rows

drop table table_s;
create table table_s (column_a text);
insert into table_s values ('30020');
vacuum full table_s;

drop table table_h;
create table table_h(column_a text,column_b text);

INSERT INTO table_h(column_a,column_b)
SELECT i::text, i::text
FROM generate_series(1, 10000000) AS t(i);
vacuum full table_h;
  • Run Hash Join (high cardinality)

We´ll run the join using column_a in both tables, that was created previously having high cardinality in table_h

explain (analyze,buffers,costs,verbose) SELECT /*+ HashJoin(s h) Leading((s h)) */ COUNT(*)
FROM table_s s
JOIN table_h h
  ON s.column_a= h.column_a;

You should see a Hash node with small Memory Usage (a few MB) and Batches: 256 or similar due to our tiny work_mem, but no ballooning. Exact numbers vary by hardware/version/stats. (EXPLAIN fields and interpretation are documented here.) [postgresql.org]

                                                                   QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=280930.01..280930.02 rows=1 width=8) (actual time=1902.965..1902.968 rows=1 loops=1)
   Output: count(*)
   Buffers: shared read=54055, temp read=135 written=34041
   ->  Hash Join  (cost=279054.00..280805.01 rows=50000 width=0) (actual time=1900.539..1902.949 rows=1 loops=1)
         Hash Cond: (s.column_a = h.column_a)
         Buffers: shared read=54055, temp read=135 written=34041
         ->  Seq Scan on public.table_s s  (cost=0.00..1.01 rows=1 width=32) (actual time=0.021..0.022 rows=1 loops=1)
               Output: s.column_a
               Buffers: shared read=1
         ->  Hash  (cost=154054.00..154054.00 rows=10000000 width=32) (actual time=1896.895..1896.896 rows=10000000 loops=1)
               Output: h.column_a
               Buckets: 65536  Batches: 256  Memory Usage: 2031kB
               Buffers: shared read=54054, temp written=33785
               ->  Seq Scan on public.table_h h  (cost=0.00..154054.00 rows=10000000 width=32) (actual time=2.538..638.830 rows=10000000 loops=1)
                     Output: h.column_a
                     Buffers: shared read=54054
 Query Identifier: 334721522907995613
 Planning:
   Buffers: shared hit=10
 Planning Time: 0.302 ms
 JIT:
   Functions: 11
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.441 ms, Inlining 0.000 ms, Optimization 0.236 ms, Emission 2.339 ms, Total 3.017 ms
 Execution Time: 1903.472 ms
(25 rows)

Findings (1) When we have the data totally distributed with high cardinality it takes only 2031kB of memory usage (work_mem), shared hit/read=54055

  • Force low cardinality / skew and re‑run

We´ll update table_h to have column_a all values to '30020', so, having only 1 distinct value for all the rows in the table

update table_h set column_a='30020', column_b='30020';
vacuum full table_h;

Checking execution plan:

                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=279056.04..279056.05 rows=1 width=8) (actual time=3568.936..3568.938 rows=1 loops=1)
   Output: count(*)
   Buffers: shared read=54056, temp read=63480 written=63480
   ->  Hash Join  (cost=279055.00..279056.03 rows=1 width=0) (actual time=2650.696..3228.610 rows=10000000 loops=1)
         Hash Cond: (s.column_a = h.column_a)
         Buffers: shared read=54056, temp read=63480 written=63480
         ->  Seq Scan on public.table_s s  (cost=0.00..1.01 rows=1 width=32) (actual time=0.007..0.008 rows=1 loops=1)
               Output: s.column_a
               Buffers: shared read=1
         ->  Hash  (cost=154055.00..154055.00 rows=10000000 width=7) (actual time=1563.987..1563.989 rows=10000000 loops=1)
               Output: h.column_a
               Buckets: 131072 (originally 131072)  Batches: 512 (originally 256)  Memory Usage: 371094kB
               Buffers: shared read=54055, temp written=31738
               ->  Seq Scan on public.table_h h  (cost=0.00..154055.00 rows=10000000 width=7) (actual time=2.458..606.422 rows=10000000 loops=1)
                     Output: h.column_a
                     Buffers: shared read=54055
 Query Identifier: 334721522907995613
 Planning:
   Buffers: shared hit=6 read=1 dirtied=1
 Planning Time: 0.237 ms
 JIT:
   Functions: 11
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.330 ms, Inlining 0.000 ms, Optimization 0.203 ms, Emission 2.311 ms, Total 2.844 ms
 Execution Time: 3584.439 ms
(25 rows)

Now, the Hash node typically reports hundreds of MB of Memory Usage, with more/larger temp spills (higher Batches, more temp_blks_*). What changed? Only the distribution (cardinality). (Why buckets/batches behave this way is covered in the algorithm references.) [postgrespro.com], [interdb.jp]

Findings (2) When we have the data distributed with LOW cardinality it takes 371094kB of memory usage (work_mem), shared hit/read=54056

So, same amount of data being handled by the query in terms of shared memory, but totally different work_mem usage pattern due to the low cardinality and the join method (Hash Join) that put most of those rows in a single bucket and that is not limited by default, so, it can cause OOM errors at any time.

  • Scale up rows to observe linear growth

We´ll add same rows in table_h repeat times so we can play with more data (low cardinality)

insert into table_h select * from table_h;
vacuum full table_h;

You’ll see Memory Usage and temp I/O scale with rowcount under skew. (Beware: this can become I/O and RAM heavy—do this incrementally.) [thoughtbot.com]

 

NumRows table_hShared read/hitDirtiedWrittenTemp read/writtenMemory Usage (work_mem)
10M540560 63480+63480371094kB
20M1081100 126956+126956742188kB
80M432434 (1,64GB)00253908+2539082968750kB (2,8GB)

 

Observability: what you will (and won’t) see

  • EXPLAIN is your friend

EXPLAIN (ANALYZE, BUFFERS) exposes Memory Usage, Buckets:, Batches: in the Hash node and temp block I/O. Batches > 1 is a near‑certain sign of spilling. [postgresql.org], [thoughtbot.com]

  • Query Store / pg_stat_statements limitations

    • Azure Database for PostgreSQL – Flexible Server Query Store aggregates runtime and (optionally) wait stats over time windows and stores them in azure_sys, with views under query_store.*. It’s fantastic to find which queries chew CPU/I/O or wait, but it doesn’t report per‑query transient memory usage (e.g., “how many MB did that hash table peak at?”) you can estimate reviewing the temporary blocks. [learn.microsoft.com]
    • Under the hood, what you do get—whether via Query Store or vanilla PostgreSQL pg_stat_statements—are cumulative counters like shared_blks_read, shared_blks_hit, temp_blks_read, temp_blks_written, timings, etc. Those help confirm buffer/temp activity, yet no direct hash table memory metric exists. Combine them with EXPLAIN and server metrics to triangulate. [postgresql.org]

Tip (Azure Flexible Server)
Enable Query Store in Server parameters via pg_qs.query_capture_mode and (optionally) wait sampling via pgms_wait_sampling.query_capture_mode, then use query_store.qs_view to correlate temp block usage and execution times across intervals. [learn.microsoft.com]

Typical OOM symptom in logs

In extreme skew with concurrent executions, you may encounter:

ERROR: out of memory

DETAIL: Failed on request of size 32800 in memory context "HashBatchContext".

 

This is a classic signature of hash join memory pressure. [postgresql.org], [thisistheway.wiki]

What to do about it (mitigations & best practices)

  • Don’t force Hash Join unless required
    If you used planner hints (e.g., pg_hint_plan) or GUCs (Grand Unified Configuration) to force Hash Join, remove them and let the planner re‑evaluate. (If you must hint, be aware pg_hint_plan is a third‑party extension and not available in all environments.) [pg-hint-pl...thedocs.io], [pg-hint-pl...thedocs.io]
  • Fix skew / cardinality at the source
    • Re‑model data to avoid low‑NDV (Number of Distinct Values in a column) joins (e.g., pre‑aggregate, filter earlier, or exclude degenerate keys).
    • Ensure statistics are current so the planner estimates are realistic. (Skew awareness is limited; poor estimates → risky sizing.) [postgresql.org]
  • Pick a safer join strategy when appropriate
    If distribution is highly skewed, Merge Join (with supporting indexes/sort order) or Nested Loop (for selective probes) might be more memory‑predictable. Let the planner choose, or enable alternatives by undoing GUCs that disabled them. [postgresql.org]
  • Bound memory consciously
    • Keep work_mem modest for mixed/OLTP workloads; remember it’s per operation, per node, per worker.
    • Adjust hash_mem_multiplier judiciously (introduced in PG13; default now commonly 2.0) if you understand the spill trade‑offs. [postgresqlco.nf], [pgpedia.info]
  • Observe spills and tune iteratively
    Use EXPLAIN (ANALYZE, BUFFERS) to see Batches (spills) and Memory Usage; use Query Store/pg_stat_statements to find which queries generate the most temp I/O. Raise work_mem for a session only when justified. [postgresql.org], [postgresql.org]
  • Parallelism awareness
    Each worker can perform its own memory‑using operations; parallel hash join has distinct behavior. If you aren’t sure, temporarily disable parallelism to simplify analysis, then re‑enable once you understand the footprint. [postgresql.org]

Validating on Azure Database for PostgreSQL – Flexible Server

The behavior is not Azure‑specific, but you can reproduce the same sequence on Flexible Server (e.g., General Purpose). A few notes:

  • Confirm/adjust work_mem, hash_mem_multiplier, enable_* planner toggles as session settings. (Azure exposes standard PostgreSQL parameters.) [learn.microsoft.com]
  • Use Query Store to confirm stable shared/temporary block patterns across executions, then use EXPLAIN (ANALYZE, BUFFERS) per query to spot hash table memory footprints. [learn.microsoft.com], [postgresql.org]
  • Changing some default parameters

We´ll repeat previous steps in Azure Database for PostgreSQL Flexible server:

set hash_mem_multiplier=1;
set max_parallel_workers=0;
set max_parallel_workers_per_gather=0;
set enable_parallel_hash=off;
set enable_material=off;
set enable_sort=off;
set pg_hint_plan.debug_print=verbose;
set client_min_messages=notice;
set pg_hint_plan.enable_hint_table=on;
  • Creating and populating tables

drop table table_s;
create table table_s (column_a text);
insert into table_s values ('30020');
vacuum full table_s;
 
drop table table_h;
create table table_h(column_a text,column_b text);
 
INSERT INTO table_h(column_a,column_b)
SELECT i::text, i::text
FROM generate_series(1, 10000000) AS t(i);
vacuum full table_h;
vacuum full table_s;
  • Query & Execution plan

explain (analyze,buffers,costs,verbose) SELECT /*+ HashJoin(s h) Leading((s h)) */ COUNT(*)
FROM table_s s
JOIN table_h h
  ON s.column_a= h.column_a;

 

                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=279052.88..279052.89 rows=1 width=8) (actual time=3171.186..3171.191 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=33 read=54023, temp read=135 written=34042
   I/O Timings: shared read=184.869, temp read=0.278 write=333.970
   ->  Hash Join  (cost=279051.84..279052.88 rows=1 width=0) (actual time=3147.288..3171.182 rows=1 loops=1)
         Hash Cond: (s.column_a = h.column_a)
         Buffers: shared hit=33 read=54023, temp read=135 written=34042
         I/O Timings: shared read=184.869, temp read=0.278 write=333.970
         ->  Seq Scan on public.table_s s  (cost=0.00..1.01 rows=1 width=32) (actual time=0.315..0.316 rows=1 loops=1)
               Output: s.column_a
               Buffers: shared read=1
               I/O Timings: shared read=0.018
         ->  Hash  (cost=154053.04..154053.04 rows=9999904 width=7) (actual time=3109.278..3109.279 rows=10000000 loops=1)
               Output: h.column_a
               Buckets: 131072  Batches: 256  Memory Usage: 2551kB
               Buffers: shared hit=32 read=54022, temp written=33786
               I/O Timings: shared read=184.851, temp write=332.059
               ->  Seq Scan on public.table_h h  (cost=0.00..154053.04 rows=9999904 width=7) (actual time=0.019..1258.472 rows=10000000 loops=1)
                     Output: h.column_a
                     Buffers: shared hit=32 read=54022
                     I/O Timings: shared read=184.851
 Query Identifier: 5636209387670245929
 Planning:
   Buffers: shared hit=37
 Planning Time: 0.575 ms
 Execution Time: 3171.375 ms
(26 rows)

Findings (3) In Azure Database for PostgreSQL Flexible server, when we have the data totally distributed with high cardinality it takes only 2551kB of memory usage (work_mem), shared hit/read=54056

 

  • Skew it to LOW cardinality

As we did previously, we change column_a to having only one different value in all rows in table_h

update table_h set column_a='30020', column_b='30020';
vacuum full table_h;

In this case we force the join method with pg_hint_plan: 

explain (analyze,buffers,costs,verbose) SELECT /*+ HashJoin(s h) Leading((s h)) */ COUNT(*)
FROM table_s s
JOIN table_h h
  ON s.column_a= h.column_a;

                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=279056.04..279056.05 rows=1 width=8) (actual time=4397.556..4397.560 rows=1 loops=1)
   Output: count(*)
   Buffers: shared hit=2 read=54055, temp read=63480 written=63480
   I/O Timings: shared read=89.396, temp read=90.377 write=300.290
   ->  Hash Join  (cost=279055.00..279056.03 rows=1 width=0) (actual time=3271.145..3987.154 rows=10000000 loops=1)
         Hash Cond: (s.column_a = h.column_a)
         Buffers: shared hit=2 read=54055, temp read=63480 written=63480
         I/O Timings: shared read=89.396, temp read=90.377 write=300.290
         ->  Seq Scan on public.table_s s  (cost=0.00..1.01 rows=1 width=32) (actual time=0.006..0.008 rows=1 loops=1)
               Output: s.column_a
               Buffers: shared hit=1
         ->  Hash  (cost=154055.00..154055.00 rows=10000000 width=7) (actual time=1958.729..1958.731 rows=10000000 loops=1)
               Output: h.column_a
               Buckets: 262144 (originally 262144)  Batches: 256 (originally 128)  Memory Usage: 371094kB
               Buffers: shared read=54055, temp written=31738
               I/O Timings: shared read=89.396, temp write=149.076
               ->  Seq Scan on public.table_h h  (cost=0.00..154055.00 rows=10000000 width=7) (actual time=0.159..789.449 rows=10000000 loops=1)
                     Output: h.column_a
                     Buffers: shared read=54055
                     I/O Timings: shared read=89.396
 Query Identifier: 8893575855188549861
 Planning:
   Buffers: shared hit=5
 Planning Time: 0.157 ms
 Execution Time: 4414.268 ms
(25 rows)

 

NumRows table_hShared read/hitDirtiedWrittenTemp read/writtenMemory Usage (work_mem)
10M540560 63480+63480371094kB
20M1081100 126956+126956742188kB
80M43243400253908+2539082968750kB

We observe the same numbers compared with our docker installation.

See the extension docs for installation/usage and the hint table for cases where you want to force a specific join method. [pg-hint-pl...thedocs.io], [pg-hint-pl...thedocs.io]

FAQ

Q: I set work_mem = 4MB. Why did my Hash Join report ~371MB Memory Usage?
A: Hash joins can use up to hash_mem_multiplier × work_mem per hash table, and skew can cause large per‑bucket chains. Multiple nodes/workers multiply usage. work_mem is not a global hard cap. [postgresqlco.nf], [pgpedia.info]

Q: How do I know if a Hash Join spilled to disk?
A: In EXPLAIN (ANALYZE), Hash node shows Batches: N. N > 1 indicates partitioning and temp I/O; you’ll also see temp_blks_read/written in buffers and Temp I/O timings. [postgresql.org], [thoughtbot.com]

Q: Can Query Store tell me per‑query memory consumption?
A: Not directly. It gives time‑sliced runtime and wait stats (plus temp/shared block counters via underlying stats), but no “peak MB used by this hash table” metric. Use EXPLAIN and server metrics. [learn.microsoft.com], [postgresql.org]

Q: I hit “Failed on request … in HashBatchContext.” What’s that?
A: That’s an OOM raised by the executor while allocating memory. Reduce skew, avoid forced hash joins, or review per‑query memory and concurrency. [postgresql.org]

Further reading

Ready‑to‑use mitigation checklist (DBA quick wins)

  • Remove joins hints/GUC overrides that force Hash Join; re‑plan. [pg-hint-pl...thedocs.io]
  • Refresh stats; confirm realistic rowcount/NDV estimates. [postgresql.org]
  • Consider alternate join strategies (Merge/Index‑Nested‑Loop) when skew is high. [postgresql.org]
  • Keep work_mem conservative for OLTP; consider session‑scoped bumps only for specific analytic queries. [postgresql.org]
  • Tune hash_mem_multiplier carefully only after understanding spill patterns. [postgresqlco.nf]
  • Use EXPLAIN (ANALYZE, BUFFERS) to verify Batches and Memory Usage. [postgresql.org]
  • Use Query Store/pg_stat_statements to find heavy temp/shared I/O offenders over time.
Updated Mar 09, 2026
Version 1.0
No CommentsBe the first to comment