Blog Post

Microsoft Blog for PostgreSQL
10 MIN READ

Connection Scaling in Elastic Clusters

EbruAydin's avatar
EbruAydin
Icon for Microsoft rankMicrosoft
Apr 22, 2026

As your applications grow, you need to manage database connections carefully to keep performance predictable and reliable. Azure Database for PostgreSQL Elastic Clusters, powered by Citus, support horizontal scaling by distributing data and queries across multiple nodes. This raises a key question: how do connections behave as you add more clients, grow the cluster, or upgrade node specifications?

 

In this post, you’ll see how connection handling behaves in Elastic Clusters through controlled benchmarks across a small set of configurations.  

  • Core count: 2 and 4 
  • Cluster node count: 4 and 8 

Using these setups, we measured how throughput, latency, and resource usage change as connection counts increase for different workloads. The goal is not just to show numbers, but to explain why those numbers behave the way they do. We deliberately chose small SKUs (2 cores/8 GB and 4 cores/32 GB) because it is easier to observe how a smaller compute responds as connection counts grow.

These results help you understand: 

  • What influences connection capacity

  • When scaling up helps more than scaling out 
  • How to tune your cluster to match real workloads 

Future posts will expand this analysis to more configurations. For now, these insights can help you make informed decisions about cluster sizing and connection management. 

What you'll learn: 

  • How single-shard and multi-shard queries behave under increasing connection loads 
  • The impact of scaling up (more cores per node) vs. scaling out (more nodes) 
  • When PgBouncer helps—and when it hurts 
  • Practical limits imposed by memory, CPU, and connection parameters 
  • Configuration guidelines for different workload patterns 

How Elastic Clusters Handle Client and Internal Connections 

Before we dive into the performance results, let’s first look at how Elastic Clusters handle connections and execute queries. 

Cluster Architecture: Where Connections Go 

In Elastic Clusters, data is distributed across multiple nodes using Citus. Each node can accept client connections and execute queries. This is true even when the data lives on another node. This design allows you to scale horizontally while staying compatible with PostgreSQL. 

When your application opens a connection, the flow looks like this: 

  1. The client connects through a load balancer on port 7432 
  2. The load balancer routes the connection to any available node 
  3. That node executes the query and talks to other nodes if needed 

From the client’s point of view, this looks like a single server. Behind the scenes, however, the node can open additional internal connections to efficiently fetch distributed data.

Query Types Determine Connection Use 

How many connections a query consumes depends on what kind of query you run.  

 

Single-Shard Queries 

Single-shard queries target data that lives on exactly one node. A typical example is a lookup by the distribution key. 

  • If you connect to the node that owns the data, the query runs locally 
  • If you connect to a different node, Citus uses an internal connection to fetch the data 

 

Multi-Shard (Fan-Out) Queries 

Multi-shard queries need data from multiple nodes. Aggregations and analytical queries often fall into this category. For these queries: 

  • The connected node has internal connections to many nodes 
  • Each node processes its local shards in parallel 
  • Results are sent back and combined 

As a result, one client connection can fan out into many internal connections. 

 

In both cases, when Citus needs to fetch data from another node, it first attempts to reuse a cached internal connection. If none is available, it creates a new connection and caches it.   

Important note: Explicit transactions change the behavior of single-shard queries. When a query runs inside a BEGIN … END block, Citus sends extra commands (BEGIN TRANSACTION + assign ID, COMMIT) to the worker node alongside the actual query. These extra network roundtrips add some coordination work compared to auto‑commit mode.

Key Configuration Parameters 

Elastic Clusters expose configuration parameters that control how connections are created, cached, and limited. To understand connection scaling—or to tune it effectively—you need to know what these parameters do and when they matter. This section focuses on a subset; see the Citus blog posts and documentation for more details. 

ParameterScopeDescriptionTypical use caseDefault
max_connectionsCluster wideTotal connections allowedSet based on resourcesVaries by SKU

citus.max_client_connections

Per-nodeConnections allowed from clientsLimit client load per nodeVaries by SKU
citus.max_cached_conns_per_worker Per client connectionCached internal connectionsControl parallelism in fan-out queries1
citus.max_adaptive_executor_pool_size Per client connectionMax connections for parallel multi-shard executionControl parallelism in fan-out queriesVaries by SKU, 1 or 16
In PostgreSQL, Memory Plays a Key Role in Connection Scaling: 
  • Idle connections typically use 2–5 MB 
  • Active connections often use 10–20 MB, depending on work_mem and query behavior 
Some Parameters Only Affect Certain Query Types: 

Parameters like citus.max_adaptive_executor_pool_size only affect multi-shard queries. 

Internal Connection Caching Is Critical: 

Internal connection caching has a large impact on performance. 

  • Setting citus.max_cached_conns_per_worker to 0 forces Citus to open new connections repeatedly 
  • This leads to additional connection setup work and reduces overall throughput

  • Values greater than 1 only help multi-shard workloads by allowing more parallelism 

In practice, disabling internal caching is usually a bad idea. 

SKU Choice Changes Which Limits You Hit First 
  • Smaller nodes tend to reach memory thresholds sooner

  • Larger nodes may hit connection limits before CPU or memory saturation 

How We Benchmarked 

To understand how connections scale in Elastic Clusters, we ran a set of controlled benchmarks using standard PostgreSQL tooling. 

Test Environment 

  • Toolpgbench (PostgreSQL's standard benchmarking utility), running on a VM in the same region and zone as the cluster  
  • Dataset: Distributed pgbench_accounts table with a scale factor of  3000 (~38 GB) and distribution column aid (account ID). 

Setup commands: 

# Create table structure pgbench -i -I "dt" # Distribute the accounts table psql -c "SELECT create_distributed_table('pgbench_accounts', 'aid');" # Populate with data pgbench -i -I "gvp" -s 3000 # Add index for multi-shard queries CREATE INDEX index_bid ON pgbench_accounts(bid);

Workloads 

Single-Shard Query: The single-shard workload targets one shard using the distribution key. 

\set aid random(1, 100000 * scale) BEGIN; SELECT abalance FROM pgbench_accounts WHERE aid = :aid; END;

Multi-Shard Query: The multi-shard workload aggregates data across multiple shards. 

\set bid random(1, scale) BEGIN; SELECT sum(abalance) FROM pgbench_accounts WHERE bid = :bid; END;

How the Benchmarks Run  

Each benchmark: 

  • Ran for 600 seconds 
  • Used 16 pgbench worker threads 
  • Varied the number of client connections (-c) per run 

For each cluster configuration, we gradually increased the client count. We stopped when we observed connection or out-of-memory errors. 

Test Matrix 

We evaluated combinations of core counts, node counts, and query types: 

WorkloadConfiguration tested
Single-shard 2-core/4-node, 2-core/8-node, 4-core/4-node, 4-core/8-node 
Multi-shard 2-core/4-node, 2-core/8-node, 4-core/4-node, 4-core/8-node 
Single-shard with PgBouncer 2-core/4-node, 2-core/8-node

Single-Shard Query Performance 

Singleshard queries are a good model for OLTP workloads. They represent indexed lookups that target a single row (or a small set of rows). 

Massive Throughput Gains with Scaling 

Peak throughput climbed from ~11.4k TPS (2c4n) to ~48.3k TPS (4c8n), a 4.3× increase with quadruple cores and double the nodes.

Higher Concurrency = Saturation Point 

Each configuration has an optimal operating point for throughput.

Near-Linear Scaling 

Doubling cores roughly doubled throughput (≈2.4–2.5× gains), and doubling nodes added ~70–75% more TPS. 

 

 

Key Takeaways from Single-shard Experiments

1. Near-Linear Scaling with Resources 

Doubling CPU cores approximately doubled throughput, and doubling node count added 70-75% more throughput:

Configuration 

Peak TPS 

Peak Clients 

CPU at Peak 

2-core, 4-node 

11,400 

64 

~95% 

2-core, 8-node 

19,400 

192 

~93% 

4-core, 4-node 

27,500 

128 

~97% 

4-core, 8-node 

48,300 

224 

~90% 

What this means: Adding compute resources effectively increases capacity. The slight sub-linearity when adding nodes (1.7× instead of 2×) is because the extra round-trips from explicit transactions only affect remote shard queries and the fraction for remote queries grows with the cluster size. In a 4-node cluster, 75% of queries hit a remote shard; in a 8-node cluster, 87.5% do. Since each remote query inside a BEGIN … END block pays the extra round-trip cost, the aggregate overhead increases as more nodes are added.

2.Saturation Points Vary by Configuration

Each cluster configuration reaches peak throughput at a specific client count. Beyond this point, additional clients lead to: 

  • Throughput plateau
  • Latency rises as the system operates near maximum capacity

  • Increased contention and context switching 

Smaller configurations reach peak throughput sooner (64 clients for 2c4n), while larger ones handle several hundred concurrent clients before reaching peak.

3. Memory Constrains Maximum Connections, Not Node Count

A key insight is that adding nodes does not necessarily increase total client capacity when the workload is constrained by memory limits.

  • 2-core clusters: Maximum ~500 concurrent clients (both 4-node and 8-node) 
  • 4-core clusters: Maximum ~1000 concurrent clients (both 4-node and 8-node) 

Why? Each node maintains roughly client_count total connections: 

  • client_count / node_count external client connections (via load balancer) 
  • Remaining connections are cached internal Citus connections 

Once memory capacity is fully utilized, adding nodes alone does not always raise client capacity. External connections drop per node, but internal connections replace them, so each node still carries roughly the same load—and still can run out of memory.

4. CPU Utilization Reaches 90-97% at Peak

All configurations fully utilized available CPU at peak throughput, confirming CPU as the primary throughput bottleneck (with memory limiting connection capacity separately). 

5. Latency Characteristics 

Latency remains low (<5-10ms) until the system approaches saturation. Larger clusters maintain better latency under load: 

  • 4c8n cluster: Sub-5ms latency up to ~200 clients 
  • 2c4n cluster: Latency exceeds 10-20ms once saturated (~64 clients) 

Practical implicationRight-sizing your cluster provides headroom for traffic spikes without latency degradation. 

Single-Shard Query Performance with PgBouncer 

On Elastic Clusters, you can enable PgBouncer using server parameters. After enabling it, you connect to PgBouncer instances through the load balancer on port 8432. This connection pooling layer allows the cluster to handle far more concurrent client connections

  • PgBouncer instances: One per node, behind load balancer 
  • Pool size: 50 connections per node (default) 
  • Pooling mode: Transaction pooling

Eliminated Connection Limits 

Handled thousands of concurrent clients without out-of-memory errors

Slightly Lower Peak TPS 

~10% reduction in maximum throughput due to pooling overhead 

Linear Latency Growth 

Predictable queuing behavior once pool saturates   

Key Takeaways from Single-shard Experiments with PgBouncer

1. Graceful handling of high connection counts 

Without PgBouncer, 2‑core clusters reached memory capacity around 500 clients. With PgBouncer enabled, tests successfully ran with 1000+ concurrent clients. Throughput plateaued as the pool saturated, but the system remained stable. 

2. Throughput-Latency Trade-Off

Once the connection pool fills (~50 active connections per node), additional clients queue: 

  • Throughput stabilizes at the pool's processing capacity 
  • Latency increases with queue depth

  • Predictable, graceful behavior under high load

3. When to use PgBouncer

Recommended for: 

  • Applications with bursty connection patterns (many short-lived connections) 
  • High connection counts that exceed node memory capacity 
  • Workloads where occasional queuing latency is acceptable 

Not recommended for: 

  • Applications requiring maximum throughput from steady workloads 
  • Long-running transactions (incompatible with transaction pooling) 
  • Scenarios where every millisecond of latency matters

Multi-Shard Query Performance 

Multi-shard (fan-out) queries aggregate or join data across multiple nodes, representing analytical or reporting workloads. 

Massive Throughput Gains with Scaling 

Scaling up and out dramatically improved fan-out query throughput – from ~52 TPS to ~1008 TPS on the largest (≈20× gain) 

 Low Concurrency Saturation 

Multi-shard queries peaked at low client counts—just 8 clients for 2-core clusters and ~96 for 4-core, 8-node setups. 

 Latency Improves with Scale 

Larger clusters maintained sub-100 ms latency under higher concurrency, while smaller ones degraded quickly. 

Memory & I/O Bottlenecks  

Sufficient memory is crucial for fan-out queries, as memory starvation causes throughput to plateau well before CPU is fully utilized.  

Key Takeaways from Multi-shard Experiments

1. Lower Absolute Throughput Compared to Single-shard Workloads

Even the best-performing configuration (4c8n at ~1000 TPS) achieves ~2% of single-shard throughput (~48k TPS). This reflects the inherent complexity of the analytical fan-out queries: cross-node data aggregation, significantly large amount of data retrieval.  

 2. Scaling Provides Dramatic Gains 

While absolute TPS remains modest, the 20× improvement from smallest to largest demonstrates that multi-shard workloads benefit enormously from scaling. 

Configuration 

Peak TPS 

Peak Clients 

Latency at Peak 

2-core, 4-node 

52 

8 

~150ms 

2-core, 8-node 

168 

8 

~50ms 

4-core, 4-node 

489 

32 

~65ms 

4-core, 8-node 

1,008 

96 

~95ms 

3. Saturates at Low Concurrency 

Multi-shard queries reach peak throughput at fewer concurrent clients than single-shard queries: 

  • 2-core clusters: Saturate at just 8 clients 
  • 4-core, 8-node: Saturates around 96 clients 

After these points, the system maintains stable TPS, with latency rising as load increases.

4. Memory and I/O Bottlenecks Dominate Small Configurations

The 2-core configurations (8 GB RAM per node) showed clear resource pressure:

  • Memory pressure: Working sets exceeded available RAM, causing paging 
  • High IOPS: Thousands of disk operations per second indicated swapping to disk 
  • Throughput ceiling: Memory availability capped TPS before CPU was fully utilized

In contrast, 4-core configurations (32 GB RAM per node) kept working sets in memory, achieving much higher throughput with minimal I/O. 

Key insight: For multi-shard workloads, sufficient memory is more important than CPU cores. Adequate memory provisioning is essential to unlock full performance.

5. Latency Escalates Rapidly Under Overload

All configurations delivered fast response times (<50ms) at low loads. Once saturated: 

  • 2c4n cluster: Latency increased noticeably under sustained overload
  • 4c8n cluster: Remained under 100ms until approaching the 96-client saturation point 

Practical implication: For multi-shard query workloads, over-provision resources to maintain consistent and predictable latency. 

Conclusion 

Connection scaling in Azure Database for PostgreSQL Elastic Clusters is a multifaceted challenge that depends on workload characteristics, cluster configuration, and resource constraints. Key takeaways from our benchmarking on read workloads: 

For Single-Shard OLTP-like Workloads: 

  • Scaling provides near-linear throughput gains (4.3× from 2c4n to 4c8n) 
  • Memory, not node count, determines maximum concurrent connections 
  • CPU becomes the throughput bottleneck at peak load 
  • PgBouncer trades ~10% throughput for almost unlimited connection scalability 

For Multi-Shard OLAP-like Workloads: 

  • Throughput operates at a different scale than singleshard workloads 
  • Relative gains from scaling are massive (20× improvement observed) 
  • Memory sufficiency is critical—adequate RAM is essential to maintain strong performance  
  • Saturation occurs at low concurrency; keep client counts conservative 

General Principles: 

  • Scale up (higher SKU) to support more connections and memory-intensive queries 
  • Scale out (more nodes) to increase aggregate throughput and data capacity 
  • Use PgBouncer to manage connection bursts and exceed node memory limits 
  • Monitor continuously and adjust based on actual workload patterns 

By understanding these dynamics and applying the decision frameworks provided, you can architect Elastic Clusters that deliver optimal performance, reliability, and cost-efficiency for your specific application requirements.

 

References and Resources 

Updated Apr 16, 2026
Version 1.0
No CommentsBe the first to comment