postgresql
210 TopicsNo code left behind: How AI streamlines Oracle-to-PostgreSQL migration
Coauthored by Jonathon Frost, Aditya Duvuri and Shriram Muthukrishnan More and more organizations are choosing PostgreSQL over proprietary database platforms such as Oracle, and for good reasons. It’s fully open source and community supported with a steady pace of innovation. It’s also preferred by developers for its extensibility and flexibility, often being used for vector data along with relational data to support modern applications and agents. Still, organizations considering a shift from Oracle to PostgreSQL, may hesitate due to the complexity that often accompanies an enterprise-scale migration project. Challenges such as incompatible data types, language mismatches, and the risk of breaking critical applications are hard to ignore. Recently, the Azure Postgres team released a new, free tool for migrations from Oracle to PostgreSQL that was designed to address these challenges, making the decision to migrate a lot less risky. The new AI-assisted Oracle-to-PostgreSQL migration tool, available in public preview via the PostgreSQL extension for Visual Studio Code, brings automation, validation, and AI-powered migration assistance into a single, user-friendly interface. Meet your new migration assistant The AI-assisted Oracle to PostgreSQL migration tool dramatically simplifies moving off Oracle databases. Accessible through VS Code, the tool uses intelligent automation, powered by GitHub Copilot, to convert Oracle database schemas and PL/SQL code into PostgreSQL-compatible formats. It can analyze Oracle schema, and automatically translate table definitions, data types, and even stored procedures/triggers into PostgreSQL equivalents speeding up migrations that once took months of manual effort. By handling the heavy lifting of schema and code conversion, this tool allows teams to focus on higher-level testing and optimization rather than tedious code rewrites. Users are already reporting that migrations are now faster, safer, and more transparent. The tool is simple, free, and ready for you to use today. Let’s take a look at how it works by covering the following: Creating the migration project Setting up the connections AI-assisted schema migration Reviewing schema migration report AI-assisted application migration Reviewing application migration report Step by step with the AI-assisted Oracle-to-PostgreSQL migration tool Step 1 – Create the project in VS Code Start by installing or updating the PostgreSQL extension for VS Code from the marketplace. Open the PostgreSQL extension panel and click “Create Migration Project.” You’ll name your project, which will create a folder to store all migration artifacts. This folder will house extracted and converted files, organized for version control and collaboration. Step 2 - Connect to your databases and AI model Before beginning the migration, you’ll need to connect to the Oracle databases and select an OpenAI model to leverage during the process. Enter the connection details for your source Oracle database, credentials, and the schema to migrate. Then, select a PostgreSQL scratch database. This temporary environment is used to validate converted DDL in real time. Next, you will be prompted to select an OpenAI model. Step 3 – Begin schema migration Once you’ve set up your connections, click the button to start the schema migration. The tool performs an extraction of all relevant Oracle database objects: tables, views, packages, procedures, and more. The extracted DDL is saved as files in your project folder. This file-based approach functions like a software project, enabling change tracking, collaboration, and source control. Enter - AI assistance This is where the AI takes over. The tool breaks the extracted schema into manageable chunks, and each chunk is processed by a multi-agent orchestration system: The Migration Specialist Agent converts Oracle DDL to PostgreSQL. The Migration Critic Agent validates the conversion by executing it in the PostgreSQL scratch database. The Documentation Agent captures follow up review tasks, metadata, and coding notes for later integration with the application code migration process. Each chunk is converted, validated, and deployed. If validation fails, the agents auto correct and retry. This self-healing loop ensures high conversion accuracy. Essentially, the tool conducts compile-time validation against a live PostgreSQL instance to catch issues early and reduce downstream surprises. Checkpoint - review the schema migration report Some complex objects, like Oracle packages with intricate PL/SQL, may not convert cleanly on the first pass. These are flagged as “review tasks.” You can invoke GitHub Copilot’s agent mode directly from VS Code to assist. The tool constructs a composite prompt with the original Oracle DDL, the partially converted PostgreSQL version, and any validation errors. This context-rich prompt enables Copilot to generate more accurate fixes. With the schema fully converted, you can compare the original Oracle and new PostgreSQL versions side by side. Right-click any object in the project folder and select “Compare File Pair." You can also use the “Visualize Schema” feature to see a graphical representation of the converted schema. This is ideal for verifying tables, relationships, and constraints. Once the schema migration is complete, the tool generates a detailed report that includes: Total number of objects converted Conversion success rate PostgreSQL version and extensions used List of converted objects by type Any flagged review tasks This report serves as both a validation summary and an audit artifact. It helps confirm success and identify any follow-up actions. If you have compliance or change management requirements you need to meet, this documentation is essential. Step 4 – Begin application migration The next phase that the tool supports is updating the application code that interacts with the schema. Migrations often stall when code is overlooked or when traditional tools treat SQL statements as simple strings rather than part of a cohesive system. The AI-assisted Oracle-to-PostgreSQL migration tool’s application conversion feature takes a more holistic, context-aware approach. Before starting, you’ll need to configure GitHub Copilot Agent Mode with a capable AI model. Then, navigate to the ‘application_code’ directory typically found in .github/postgres-migration/<project_name>/application_code, and copy your source code into this directory. Keeping your application and converted schema together provides the AI with the structural context it needs to refactor your code accurately. To start the app migration, this time you’d select the "Migrate Application" button. Then select the folder containing your source code and the converted schema. Enter - AI assistance The AI orchestrator will analyze your application’s database interactions against the new Postgres schema and generate a series of transformation tasks. These tasks address SQL dialect changes, data access modifications, and library updates. This process goes beyond a simple search-and-replace operation. The AI queries your migrated PostgreSQL database to gain grounded context of your converted schema, and ensures that things like function signatures, data types, and ORM models are migrated correctly in the application code. Checkpoint - review the app migration report When the AI finishes converting your application, it produces a detailed summary. The report lists which files were migrated, notes any unresolved tasks, and outlines how the changes map to the database schema. This audit-ready document can help DBAs and developers collaborate effectively on follow-up actions and integration testing. You can use VS Code’s built-in diff viewer to compare each migrated file with its original. Right-click on a migrated file and select "Compare App Migration File Pairs" to open a side-by-side view. This comparison highlights differences in SQL queries, driver imports, and other code changes, allowing you to verify the updates. Wrapping up the migration project During schema migration, the tool created detailed coding notes summarizing data-type mappings, constraints, and package transformations. These notes are essential for understanding why specific changes were made and for guiding the application conversion. Use them as reference points when validating and refining the AI-generated application code. Destination - PostgreSQL on Azure The AI-assisted Oracle-to-PostgreSQL migration tool brings together automation, validation, and AI to make Oracle-to-PostgreSQL migrations faster, safer, and more transparent. With schema extraction, multi-agent orchestration, app conversion, real-time validation, and detailed reporting, it provides a clear, confident path to modernization so you can start taking advantage of the benefits of open-source Postgres. What’s in store On the other side of a successful migration project to PostgreSQL on Azure, you get: First-class support in Azure Significantly lower total cost of ownership from eliminating license fees and reducing vendor lock-in Unmatched extensibility, with support for custom data types, procedural languages and powerful extensions like PostGIS, TimescaleDB, pgvector, Azure AI, and DiskANN Frequent updates and cutting-edge features delivered via a vibrant open-source community Whether you’re migrating a single schema or leading a broader replatforming initiative, the AI-assisted Oracle-to-PostgreSQL migration tool helps you move forward with confidence without sacrificing control or visibility. Learn more about starting your own migration project.PostgreSQL Buffer Cache Analysis
PostgreSQL performance is often dictated not just by query design or indexing strategy, but by how effectively the database leverages memory. At the heart of this memory usage lies shared_buffers—PostgreSQL’s primary buffer cache. Understanding how well this cache is utilized can make the difference between a system that scales smoothly and one that struggles under load. In this post, we’ll walk you through a practical, data-driven approach to analyzing PostgreSQL buffer cache behavior using native statistics and the pg_buffercache extension. The goal is to answer a few critical questions: Is the current shared_buffers configuration sufficient? Are high-value tables and indexes actually being served from memory? Is PostgreSQL spending too much time going to disk when it shouldn’t? By the end, you’ll have a repeatable methodology to assess cache efficiency and make informed tuning decisions. Why Buffer Cache Analysis Matters PostgreSQL relies heavily on its buffer cache to minimize disk I/O. Every time a query needs a data or index page, PostgreSQL first checks whether that page already exists in shared_buffers. If it does, the page is served directly from memory—fast and efficient. If not, PostgreSQL must fetch it from disk (or the OS page cache), which is significantly slower. While metrics like query latency and IOPS can tell you that performance is degraded, buffer cache analysis helps explain why. It allows you to: Validate whether frequently accessed objects stay hot in cache Identify cache pollution caused by large, low-value tables Determine whether increasing shared_buffers would provide real benefits or just waste memory Inspecting Shared Buffers with pg_buffercache The pg_buffercache extension provides a real-time view into PostgreSQL’s shared buffers. Unlike cumulative statistics, it shows what is in memory right now—which relations are cached, how many blocks they occupy, and how frequently those buffers are reused. Enabling the Extension pg_buffercache is not enabled by default and requires superuser privileges: CREATE EXTENSION pg_buffercache; Once enabled, you can directly query the contents of shared buffers across databases, tables, and indexes. Analyzing Cache Distribution Understanding where your shared buffers are being consumed is the first step toward meaningful tuning. Database-Level Cache Distribution This query shows how shared buffers are distributed across databases in the server: SELECT CASE WHEN c.reldatabase IS NULL THEN '' WHEN c.reldatabase = 0 THEN '' ELSE d.datname END AS database, count(*) AS cached_blocks FROM pg_buffercache AS c LEFT JOIN pg_database AS d ON c.reldatabase = d.oid WHERE datname NOT LIKE 'template%' GROUP BY d.datname, c.reldatabase ORDER BY d.datname, c.reldatabase; This is particularly useful in multi-database environments where one workload may be evicting cache pages needed by another. Table and Index-Level Cache Consumption To understand which relations, dominate the cache, the following query breaks buffer usage down by tables and indexes: SELECT c.relname, c.relkind, count(*) FROM pg_database AS a, pg_buffercache AS b, pg_class AS c WHERE c.relfilenode = b.relfilenode AND b.reldatabase = a.oid GROUP BY 1, 2 ORDER BY 3 DESC, 1; This helps answer an important question: Are your most business-critical tables and indexes actually resident in memory, or are they constantly being evicted? If large, rarely used tables consume a disproportionate share of buffers, it may indicate cache churn or the need for workload isolation. Understanding Buffer Usage Count (Hot vs Cold Data) Each buffer in shared memory carries a usage count, which reflects how frequently it has been accessed before eviction. Higher values indicate hotter data. SELECT c.relname, c.relkind, usagecount, count(*) AS buffers FROM pg_database AS a, pg_buffercache AS b, pg_class AS c WHERE c.relfilenode = b.relfilenode AND b.reldatabase = a.oid AND a.datname = current_database() GROUP BY 1, 2, 3 ORDER BY 3 DESC, 1; A healthy system typically shows a meaningful number of buffers with higher usage counts (for example, 4–5), indicating frequently reused data that benefits from caching. Buffer Cache Percentages: Putting Numbers in Context Raw buffer counts are useful, but percentages make interpretation easier. The following query shows: How much of shared_buffers each relation occupies What percentage of the relation itself is cached SELECT c.relname, pg_size_pretty(count(*) * 8192) AS buffered, round(100.0 * count(*) / (SELECT setting FROM pg_settings WHERE name='shared_buffers')::integer, 1) AS buffers_percent, round(100.0 * count(*) * 8192 / pg_relation_size(c.oid), 1) AS percent_of_relation FROM pg_class c JOIN pg_buffercache b ON b.relfilenode = c.relfilenode JOIN pg_database d ON b.reldatabase = d.oid AND d.datname = current_database() GROUP BY c.oid, c.relname ORDER BY 3 DESC LIMIT 10; This view is especially powerful when validating whether performance-critical objects are adequately cached relative to their size. Complementing Cache Views with I/O Statistics While pg_buffercache shows the current state of memory, I/O statistics reveal long-term trends. PostgreSQL exposes these via pg_statio_user_tables and pg_statio_user_indexes. Table Heap Hit Ratios SELECT relname, heap_blks_hit::numeric / (heap_blks_hit + heap_blks_read) AS hit_pct, heap_blks_hit, heap_blks_read FROM pg_catalog.pg_statio_user_tables WHERE (heap_blks_hit + heap_blks_read) > 0 ORDER BY hit_pct; Hit ratios close to 1 indicate that table data is largely served from memory rather than disk. Index Hit Ratios SELECT relname, idx_blks_hit::numeric / (idx_blks_hit + idx_blks_read) AS hit_pct, idx_blks_hit, idx_blks_read FROM pg_catalog.pg_statio_user_tables WHERE (idx_blks_hit + idx_blks_read) > 0 ORDER BY hit_pct; Poor index hit ratios often point to insufficient cache or inefficient query patterns that bypass indexes. Including TOAST and Index Reads For large objects, TOAST activity can significantly impact I/O. This query provides a more holistic view: SELECT *, (heap_blks_read + toast_blks_read + tidx_blks_read) AS total_blocks_read, (heap_blks_hit + toast_blks_hit + tidx_blks_hit) AS total_blocks_hit FROM pg_catalog.pg_statio_user_tables; This helps identify indexes that are frequently read from disk and may benefit from better caching or query rewrites. How to Interpret the Results When reviewing buffer cache and I/O metrics, keep the following guidelines in mind: Validate cache residency of critical objects: If business-critical tables and indexes occupy a meaningful share of shared_buffers, your cache sizing is likely reasonable. Correlate buffer data with hit ratios: High hit ratios in pg_statio_user_tables and pg_statio_user_indexes confirm effective caching. Persistently low ratios may justify increasing shared_buffers. Analyze usage count distribution: A healthy number of buffers with higher usage counts indicates hot data benefiting from cache reuse. Avoid over-tuning: If most buffers have low usage counts but hit ratios remain high, increasing shared_buffers further may not yield measurable gains. Conclusion Buffer cache analysis bridges the gap between theory and reality in PostgreSQL performance tuning. By combining real-time cache inspection with long-term I/O statistics, you gain a clear picture of how memory is actually used—and whether changes to shared_buffers will deliver tangible benefits. Rather than tuning memory blindly, this approach lets you optimize with confidence, grounded in data that reflects your real workload.126Views2likes0CommentsGraphRAG and PostgreSQL integration in docker with Cypher query and AI agents (Version 2*)
This is update from previous blog (version 1): GraphRAG and PostgreSQL integration in docker with Cypher query and AI agents | Microsoft Community Hub Review the business needs of this solution from version 1 What's new in version 2? MCP tools for GraphRAG and PostgreSQL with Apache AGE This solution now includes MCP tools for GraphRAG and PostgreSQL. There are five MCP tools exposed: [graphrag_search] Used to run query (local or global) with runtime-tunable API parameters. One important aspect is that query behavior can be tuned at runtime, without changing the underlying index. [age_get_schema_cached] Used for schema inspection and diagnostics. It returns the graph schema (node labels and relationship types) from cache by default; and can optionally refresh the cache by re‑querying the database. This tool is typically used for introspection or debugging, not for answering user questions about data. [age_entity_lookup] Used for quick entity discovery and disambiguation. It performs a simple substring match on entity names or titles and is especially useful for questions like “Who is X?” or as a preliminary step before issuing more complex graph queries. [age_cypher_query] Executes a user‑provided Cypher query directly against the AGE graph. This is intended for advanced users who already know the graph structure and want full control over traversal logic and filters. [age_nl2cypher_query] Bridges natural language and Cypher. This tool converts a natural‑language question into a Cypher query (using only Entity nodes and RELATED_TO edges), executes it, and returns the results. It is most effective for multi‑hop or structurally complex questions where semantic interpretation is needed first, but execution must remain deterministic. Besides that, This solution now uses Microsoft agent framework. It enables clean orchestration over MCP tools, allowing the agent to dynamically select between GraphRAG and graph query capabilities at runtime, with a looser coupling and clearer execution model than traditional Semantic Kernel function plugins. The new Docker image includes graphRAG3.0.5. This version stabilizes the 3.x configuration‑driven, API‑based architecture and improves indexing reliability, making graph construction more predictable and easier to integrate into real workflows. New architecture Updated Step 7 - run query in Jupyter notebook This step runs Jupyter notebook in docker, which is the same as stated in previous blog. > docker compose up query-notebook After clicking the link highlighted in the above screen shot, you can explore all files within the project in the docker, then find the query-notebook.ipynb. https://github.com/Azure-Samples/postgreSQL-graphRAG-docker/blob/main/project_folder/query-notebook.ipynb But in this new version of notebook, the graphRAG3.0.5 uses different library for local Search and global Search. New Step 8 - run agent and MCP tools in Jupyter notebook This step runs Jupyter notebook in docker. > docker compose up mcp-agent Click on the highlighted URL, you can start working on agent-notebook.ipynb. https://github.com/Azure-Samples/postgreSQL-graphRAG-docker/blob/main/project_folder/agent-notebook.... Multiple scenarios of agents with MCP tools are included in the notebook: GraphRAG search: local search and global search examples with direct mcp call. GraphRAG search: local search and global search examples with agent and include mcp tools. Cypher query in direct mcp call. Agent to query in natural language, and mcp tool included to convert NL2Cypher. Agent with unified mcp (all five mcp tools), and based on the question route to the corresponding tool. ['graphrag_search', 'age_get_schema_cached', 'age_cypher_query', 'age_entity_lookup', 'age_nl2cypher_query'] Router agent: selecting the right MCP tool The notebook also includes a router agent that has access to all five MCP tools and decides which one to invoke based on the user’s question. Rather than hard‑coding execution paths, the agent reasons about intent and selects the most appropriate capability at runtime. General routing guidance used in this solution Use [graphrag_search] when the question requires: full dataset understanding, themes, patterns, or trends across documents, exploratory or open‑ended analysis, global understanding or evaluation where we have a corpus of many tokens. In these cases, GraphRAG’s semantic retrieval and aggregation are a better fit than explicit graph traversal. Use AGE‑based tools [age_get_schema_cached, age_entity_lookup, age_cypher_query, age_nl2cypher_query] when the question involves: specific entities or explicit relationships, deterministic graph traversal or filtering, questions that depend on graph structure rather than document semantics, complex graph queries involving multiple entities or multi‑hop paths. Within the AGE toolset: [age_entity_lookup] is typically used for quick entity discovery or disambiguation. [age_cypher_query] is used when a precise Cypher query is already known. [age_nl2cypher_query] is used when the question is expressed in natural language but requires a non‑trivial Cypher query to answer. [age_get_schema_cached] is reserved for schema inspection and diagnostics. The router agent dynamically selects between semantic search and deterministic graph tools based on question intent, keeping retrieval, graph execution, and orchestration clearly separated and extensible. Note: The repository also includes [age_get_schema] and [age_get_schema_details] MCP tools for debugging and development purposes. These are not exposed to agents by default and are superseded by [age_get_schema_cached] for normal use. Key takeaways GraphRAG and postgreSQL AGE querying serve different purposes and each has its advantages. MCP tools provide a uniform interface to both semantic search and deterministic graph operations. Microsoft Agent Framework enables tool‑centric orchestration, where agents select the right capability at runtime instead of hard‑coding logic in prompts. The Jupyter‑based agent workflow makes it easy to experiment with different interaction patterns, from direct tool calls to fully routed agent execution. What's next In this solution, the MCP server and agent runtime are architecturally separated but deployed together in a single Docker container to demonstrate how MCP tools work and to keep local experimentation simple. There are other deployment options, such as running MCP servers remotely, where tools can be hosted and operated independently of the agent runtime. Contributions and enhancements are welcome.217Views1like0CommentsUnderstanding Hash Join Memory Usage and OOM Risks in PostgreSQL
Background: Why Memory Usage May Exceed work_mem work_mem is commonly assumed to be a hard upper bound on per‑query memory usage. However, for Hash Join operations, memory consumption depends not only on this parameter but also on: ✅ Data cardinality ✅ Hash table internal bucket distribution ✅ Join column characteristics ✅ Number of batches created ✅ Parallel workers involved Under low‑cardinality conditions, a Hash Join may place an extremely large number of rows into very few buckets—sometimes a single bucket. This causes unexpectedly large memory allocations that exceed the nominal work_mem limit. Background: What work_mem really means for Hash Joins work_mem controls the amount of memory available per operation (e.g., a sort or a hash) per node (and per parallel worker) before spilling to disk. Hash operations can additionally use hash_mem_multiplier×work_mem for their hash tables. [postgresql.org], [postgresqlco.nf] The Hash Join algorithm builds a hash table for the “build/inner” side and probes it with the “outer” side. The table is split into buckets; if it doesn’t fit in memory, PostgreSQL partitions work into batches (spilling to temporary files). Skewed distributions (e.g., very few distinct join keys) pack many rows into the same bucket(s), exploding memory usage even when work_mem is small. [postgrespro.com], [interdb.jp] In EXPLAIN (ANALYZE) you’ll see Buckets:, Batches:, and Memory Usage: on the Hash node; Batches > 1 indicates spilling/partitioning. [postgresql.org], [thoughtbot.com] The default for hash_mem_multiplier is version‑dependent (introduced in PG13; 1.0 in early versions, later 2.0). Tune with care; it scales the memory that hash operations may consume relative to work_mem. [pgpedia.info] A safe, reproducible demo (containerized community PostgreSQL) The goal is to show that data distribution alone can drive order(s) of magnitude difference in hash table memory, using conservative settings. In order to simulate the behavior we´ll use pg_hint_plan extension to guide the execution plans and create some data distribution that may not have a good application logic, just to force and show the behavior. Start PostgreSQL 16 container docker run --name=postgresql16.8 -p 5414:5432 -e POSTGRES_PASSWORD=<password> -d postgres:16.8 docker exec -it postgresql16.8 /bin/bash -c "apt-get update -y;apt-get install procps -y;apt-get install postgresql-16-pg-hint-plan -y;apt-get install vim -y;apt-get install htop -y" docker exec -it postgresql16.8 /bin/bash vi /var/lib/postgresql/data/postgresql.conf -- Adding pg_hint_plan to shared_preload_libraries psql -h localhost -U postgres create extension pg_hint_plan; docker stop postgresql16.8 docker start postgresql16.8 To connect to our docker container we use: psql -h localhost -p 5414 -U postgres Connect and apply conservative session-level settings We’ll discourage other join methods so the planner prefers Hash Join—without needing any extension. set hash_mem_multiplier=1; set max_parallel_workers=0; set max_parallel_workers_per_gather=0; set enable_parallel_hash=off; set enable_material=off; set enable_sort=off; set pg_hint_plan.debug_print=verbose; set client_min_messages=notice; set pg_hint_plan.enable_hint_table=on; Create tables and load data We´ll create two tables for the join, table_1, with a single row, table_h initially with 10mill rows drop table table_s; create table table_s (column_a text); insert into table_s values ('30020'); vacuum full table_s; drop table table_h; create table table_h(column_a text,column_b text); INSERT INTO table_h(column_a,column_b) SELECT i::text, i::text FROM generate_series(1, 10000000) AS t(i); vacuum full table_h; Run Hash Join (high cardinality) We´ll run the join using column_a in both tables, that was created previously having high cardinality in table_h explain (analyze,buffers,costs,verbose) SELECT /*+ HashJoin(s h) Leading((s h)) */ COUNT(*) FROM table_s s JOIN table_h h ON s.column_a= h.column_a; You should see a Hash node with small Memory Usage (a few MB) and Batches: 256 or similar due to our tiny work_mem, but no ballooning. Exact numbers vary by hardware/version/stats. (EXPLAIN fields and interpretation are documented here.) [postgresql.org] QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=280930.01..280930.02 rows=1 width=8) (actual time=1902.965..1902.968 rows=1 loops=1) Output: count(*) Buffers: shared read=54055, temp read=135 written=34041 -> Hash Join (cost=279054.00..280805.01 rows=50000 width=0) (actual time=1900.539..1902.949 rows=1 loops=1) Hash Cond: (s.column_a = h.column_a) Buffers: shared read=54055, temp read=135 written=34041 -> Seq Scan on public.table_s s (cost=0.00..1.01 rows=1 width=32) (actual time=0.021..0.022 rows=1 loops=1) Output: s.column_a Buffers: shared read=1 -> Hash (cost=154054.00..154054.00 rows=10000000 width=32) (actual time=1896.895..1896.896 rows=10000000 loops=1) Output: h.column_a Buckets: 65536 Batches: 256 Memory Usage: 2031kB Buffers: shared read=54054, temp written=33785 -> Seq Scan on public.table_h h (cost=0.00..154054.00 rows=10000000 width=32) (actual time=2.538..638.830 rows=10000000 loops=1) Output: h.column_a Buffers: shared read=54054 Query Identifier: 334721522907995613 Planning: Buffers: shared hit=10 Planning Time: 0.302 ms JIT: Functions: 11 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 0.441 ms, Inlining 0.000 ms, Optimization 0.236 ms, Emission 2.339 ms, Total 3.017 ms Execution Time: 1903.472 ms (25 rows) Findings (1) When we have the data totally distributed with high cardinality it takes only 2031kB of memory usage (work_mem), shared hit/read=54055 Force low cardinality / skew and re‑run We´ll update table_h to have column_a all values to '30020', so, having only 1 distinct value for all the rows in the table update table_h set column_a='30020', column_b='30020'; vacuum full table_h; Checking execution plan: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=279056.04..279056.05 rows=1 width=8) (actual time=3568.936..3568.938 rows=1 loops=1) Output: count(*) Buffers: shared read=54056, temp read=63480 written=63480 -> Hash Join (cost=279055.00..279056.03 rows=1 width=0) (actual time=2650.696..3228.610 rows=10000000 loops=1) Hash Cond: (s.column_a = h.column_a) Buffers: shared read=54056, temp read=63480 written=63480 -> Seq Scan on public.table_s s (cost=0.00..1.01 rows=1 width=32) (actual time=0.007..0.008 rows=1 loops=1) Output: s.column_a Buffers: shared read=1 -> Hash (cost=154055.00..154055.00 rows=10000000 width=7) (actual time=1563.987..1563.989 rows=10000000 loops=1) Output: h.column_a Buckets: 131072 (originally 131072) Batches: 512 (originally 256) Memory Usage: 371094kB Buffers: shared read=54055, temp written=31738 -> Seq Scan on public.table_h h (cost=0.00..154055.00 rows=10000000 width=7) (actual time=2.458..606.422 rows=10000000 loops=1) Output: h.column_a Buffers: shared read=54055 Query Identifier: 334721522907995613 Planning: Buffers: shared hit=6 read=1 dirtied=1 Planning Time: 0.237 ms JIT: Functions: 11 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 0.330 ms, Inlining 0.000 ms, Optimization 0.203 ms, Emission 2.311 ms, Total 2.844 ms Execution Time: 3584.439 ms (25 rows) Now, the Hash node typically reports hundreds of MB of Memory Usage, with more/larger temp spills (higher Batches, more temp_blks_*). What changed? Only the distribution (cardinality). (Why buckets/batches behave this way is covered in the algorithm references.) [postgrespro.com], [interdb.jp] Findings (2) When we have the data distributed with LOW cardinality it takes 371094kB of memory usage (work_mem), shared hit/read=54056 So, same amount of data being handled by the query in terms of shared memory, but totally different work_mem usage pattern due to the low cardinality and the join method (Hash Join) that put most of those rows in a single bucket and that is not limited by default, so, it can cause OOM errors at any time. Scale up rows to observe linear growth We´ll add same rows in table_h repeat times so we can play with more data (low cardinality) insert into table_h select * from table_h; vacuum full table_h; You’ll see Memory Usage and temp I/O scale with rowcount under skew. (Beware: this can become I/O and RAM heavy—do this incrementally.) [thoughtbot.com] NumRows table_h Shared read/hit Dirtied Written Temp read/written Memory Usage (work_mem) 10M 54056 0 63480+63480 371094kB 20M 108110 0 126956+126956 742188kB 80M 432434 (1,64GB) 0 0 253908+253908 2968750kB (2,8GB) Observability: what you will (and won’t) see EXPLAIN is your friend EXPLAIN (ANALYZE, BUFFERS) exposes Memory Usage, Buckets:, Batches: in the Hash node and temp block I/O. Batches > 1 is a near‑certain sign of spilling. [postgresql.org], [thoughtbot.com] Query Store / pg_stat_statements limitations Azure Database for PostgreSQL – Flexible Server Query Store aggregates runtime and (optionally) wait stats over time windows and stores them in azure_sys, with views under query_store.*. It’s fantastic to find which queries chew CPU/I/O or wait, but it doesn’t report per‑query transient memory usage (e.g., “how many MB did that hash table peak at?”) you can estimate reviewing the temporary blocks. [learn.microsoft.com] Under the hood, what you do get—whether via Query Store or vanilla PostgreSQL pg_stat_statements—are cumulative counters like shared_blks_read, shared_blks_hit, temp_blks_read, temp_blks_written, timings, etc. Those help confirm buffer/temp activity, yet no direct hash table memory metric exists. Combine them with EXPLAIN and server metrics to triangulate. [postgresql.org] Tip (Azure Flexible Server) Enable Query Store in Server parameters via pg_qs.query_capture_mode and (optionally) wait sampling via pgms_wait_sampling.query_capture_mode, then use query_store.qs_view to correlate temp block usage and execution times across intervals. [learn.microsoft.com] Typical OOM symptom in logs In extreme skew with concurrent executions, you may encounter: ERROR: out of memory DETAIL: Failed on request of size 32800 in memory context "HashBatchContext". This is a classic signature of hash join memory pressure. [postgresql.org], [thisistheway.wiki] What to do about it (mitigations & best practices) Don’t force Hash Join unless required If you used planner hints (e.g., pg_hint_plan) or GUCs (Grand Unified Configuration) to force Hash Join, remove them and let the planner re‑evaluate. (If you must hint, be aware pg_hint_plan is a third‑party extension and not available in all environments.) [pg-hint-pl...thedocs.io], [pg-hint-pl...thedocs.io] Fix skew / cardinality at the source Re‑model data to avoid low‑NDV (Number of Distinct Values in a column) joins (e.g., pre‑aggregate, filter earlier, or exclude degenerate keys). Ensure statistics are current so the planner estimates are realistic. (Skew awareness is limited; poor estimates → risky sizing.) [postgresql.org] Pick a safer join strategy when appropriate If distribution is highly skewed, Merge Join (with supporting indexes/sort order) or Nested Loop (for selective probes) might be more memory‑predictable. Let the planner choose, or enable alternatives by undoing GUCs that disabled them. [postgresql.org] Bound memory consciously Keep work_mem modest for mixed/OLTP workloads; remember it’s per operation, per node, per worker. Adjust hash_mem_multiplier judiciously (introduced in PG13; default now commonly 2.0) if you understand the spill trade‑offs. [postgresqlco.nf], [pgpedia.info] Observe spills and tune iteratively Use EXPLAIN (ANALYZE, BUFFERS) to see Batches (spills) and Memory Usage; use Query Store/pg_stat_statements to find which queries generate the most temp I/O. Raise work_mem for a session only when justified. [postgresql.org], [postgresql.org] Parallelism awareness Each worker can perform its own memory‑using operations; parallel hash join has distinct behavior. If you aren’t sure, temporarily disable parallelism to simplify analysis, then re‑enable once you understand the footprint. [postgresql.org] Validating on Azure Database for PostgreSQL – Flexible Server The behavior is not Azure‑specific, but you can reproduce the same sequence on Flexible Server (e.g., General Purpose). A few notes: Confirm/adjust work_mem, hash_mem_multiplier, enable_* planner toggles as session settings. (Azure exposes standard PostgreSQL parameters.) [learn.microsoft.com] Use Query Store to confirm stable shared/temporary block patterns across executions, then use EXPLAIN (ANALYZE, BUFFERS) per query to spot hash table memory footprints. [learn.microsoft.com], [postgresql.org] Changing some default parameters We´ll repeat previous steps in Azure Database for PostgreSQL Flexible server: set hash_mem_multiplier=1; set max_parallel_workers=0; set max_parallel_workers_per_gather=0; set enable_parallel_hash=off; set enable_material=off; set enable_sort=off; set pg_hint_plan.debug_print=verbose; set client_min_messages=notice; set pg_hint_plan.enable_hint_table=on; Creating and populating tables drop table table_s; create table table_s (column_a text); insert into table_s values ('30020'); vacuum full table_s; drop table table_h; create table table_h(column_a text,column_b text); INSERT INTO table_h(column_a,column_b) SELECT i::text, i::text FROM generate_series(1, 10000000) AS t(i); vacuum full table_h; vacuum full table_s; Query & Execution plan explain (analyze,buffers,costs,verbose) SELECT /*+ HashJoin(s h) Leading((s h)) */ COUNT(*) FROM table_s s JOIN table_h h ON s.column_a= h.column_a; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=279052.88..279052.89 rows=1 width=8) (actual time=3171.186..3171.191 rows=1 loops=1) Output: count(*) Buffers: shared hit=33 read=54023, temp read=135 written=34042 I/O Timings: shared read=184.869, temp read=0.278 write=333.970 -> Hash Join (cost=279051.84..279052.88 rows=1 width=0) (actual time=3147.288..3171.182 rows=1 loops=1) Hash Cond: (s.column_a = h.column_a) Buffers: shared hit=33 read=54023, temp read=135 written=34042 I/O Timings: shared read=184.869, temp read=0.278 write=333.970 -> Seq Scan on public.table_s s (cost=0.00..1.01 rows=1 width=32) (actual time=0.315..0.316 rows=1 loops=1) Output: s.column_a Buffers: shared read=1 I/O Timings: shared read=0.018 -> Hash (cost=154053.04..154053.04 rows=9999904 width=7) (actual time=3109.278..3109.279 rows=10000000 loops=1) Output: h.column_a Buckets: 131072 Batches: 256 Memory Usage: 2551kB Buffers: shared hit=32 read=54022, temp written=33786 I/O Timings: shared read=184.851, temp write=332.059 -> Seq Scan on public.table_h h (cost=0.00..154053.04 rows=9999904 width=7) (actual time=0.019..1258.472 rows=10000000 loops=1) Output: h.column_a Buffers: shared hit=32 read=54022 I/O Timings: shared read=184.851 Query Identifier: 5636209387670245929 Planning: Buffers: shared hit=37 Planning Time: 0.575 ms Execution Time: 3171.375 ms (26 rows) Findings (3) In Azure Database for PostgreSQL Flexible server, when we have the data totally distributed with high cardinality it takes only 2551kB of memory usage (work_mem), shared hit/read=54056 Skew it to LOW cardinality As we did previously, we change column_a to having only one different value in all rows in table_h update table_h set column_a='30020', column_b='30020'; vacuum full table_h; In this case we force the join method with pg_hint_plan: explain (analyze,buffers,costs,verbose) SELECT /*+ HashJoin(s h) Leading((s h)) */ COUNT(*) FROM table_s s JOIN table_h h ON s.column_a= h.column_a; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=279056.04..279056.05 rows=1 width=8) (actual time=4397.556..4397.560 rows=1 loops=1) Output: count(*) Buffers: shared hit=2 read=54055, temp read=63480 written=63480 I/O Timings: shared read=89.396, temp read=90.377 write=300.290 -> Hash Join (cost=279055.00..279056.03 rows=1 width=0) (actual time=3271.145..3987.154 rows=10000000 loops=1) Hash Cond: (s.column_a = h.column_a) Buffers: shared hit=2 read=54055, temp read=63480 written=63480 I/O Timings: shared read=89.396, temp read=90.377 write=300.290 -> Seq Scan on public.table_s s (cost=0.00..1.01 rows=1 width=32) (actual time=0.006..0.008 rows=1 loops=1) Output: s.column_a Buffers: shared hit=1 -> Hash (cost=154055.00..154055.00 rows=10000000 width=7) (actual time=1958.729..1958.731 rows=10000000 loops=1) Output: h.column_a Buckets: 262144 (originally 262144) Batches: 256 (originally 128) Memory Usage: 371094kB Buffers: shared read=54055, temp written=31738 I/O Timings: shared read=89.396, temp write=149.076 -> Seq Scan on public.table_h h (cost=0.00..154055.00 rows=10000000 width=7) (actual time=0.159..789.449 rows=10000000 loops=1) Output: h.column_a Buffers: shared read=54055 I/O Timings: shared read=89.396 Query Identifier: 8893575855188549861 Planning: Buffers: shared hit=5 Planning Time: 0.157 ms Execution Time: 4414.268 ms (25 rows) NumRows table_h Shared read/hit Dirtied Written Temp read/written Memory Usage (work_mem) 10M 54056 0 63480+63480 371094kB 20M 108110 0 126956+126956 742188kB 80M 432434 0 0 253908+253908 2968750kB We observe the same numbers compared with our docker installation. See the extension docs for installation/usage and the hint table for cases where you want to force a specific join method. [pg-hint-pl...thedocs.io], [pg-hint-pl...thedocs.io] FAQ Q: I set work_mem = 4MB. Why did my Hash Join report ~371MB Memory Usage? A: Hash joins can use up to hash_mem_multiplier × work_mem per hash table, and skew can cause large per‑bucket chains. Multiple nodes/workers multiply usage. work_mem is not a global hard cap. [postgresqlco.nf], [pgpedia.info] Q: How do I know if a Hash Join spilled to disk? A: In EXPLAIN (ANALYZE), Hash node shows Batches: N. N > 1 indicates partitioning and temp I/O; you’ll also see temp_blks_read/written in buffers and Temp I/O timings. [postgresql.org], [thoughtbot.com] Q: Can Query Store tell me per‑query memory consumption? A: Not directly. It gives time‑sliced runtime and wait stats (plus temp/shared block counters via underlying stats), but no “peak MB used by this hash table” metric. Use EXPLAIN and server metrics. [learn.microsoft.com], [postgresql.org] Q: I hit “Failed on request … in HashBatchContext.” What’s that? A: That’s an OOM raised by the executor while allocating memory. Reduce skew, avoid forced hash joins, or review per‑query memory and concurrency. [postgresql.org] Further reading Server parameters & memory (official docs): guidance on work_mem, shared_buffers, parallelism. [postgresql.org] Hash joins under the hood: deep dives into buckets, batches, and memory footprints. [postgrespro.com], [pgcon.org] hash_mem_multiplier: history and defaults by version. [pgpedia.info] EXPLAIN primer: how to read Hash node details, Batches, Memory Usage. [postgresql.org], [thoughtbot.com] Query Store (Azure Flexible): enable, query, and interpret. [learn.microsoft.com] Ready‑to‑use mitigation checklist (DBA quick wins) Remove joins hints/GUC overrides that force Hash Join; re‑plan. [pg-hint-pl...thedocs.io] Refresh stats; confirm realistic rowcount/NDV estimates. [postgresql.org] Consider alternate join strategies (Merge/Index‑Nested‑Loop) when skew is high. [postgresql.org] Keep work_mem conservative for OLTP; consider session‑scoped bumps only for specific analytic queries. [postgresql.org] Tune hash_mem_multiplier carefully only after understanding spill patterns. [postgresqlco.nf] Use EXPLAIN (ANALYZE, BUFFERS) to verify Batches and Memory Usage. [postgresql.org] Use Query Store/pg_stat_statements to find heavy temp/shared I/O offenders over time.Nasdaq builds thoughtfully designed AI for board governance with PostgreSQL on Azure
Authored by: Charles Federssen, Partner Director of Product Management for PostgreSQL at Microsoft and Mohsin Shafqat, Senior Manager, Software Engineering at Nasdaq When people think of Nasdaq, they usually think of markets, trading floors, and financial data moving at extraordinary speed. But behind the scenes, Nasdaq also plays an equally critical role in how boards of directors govern, deliberate, and make decisions. Nasdaq Boardvantage® is the company’s governance platform, used by more than 4,400 organizations worldwide—including nearly half of the Fortune 100. It’s where directors review board books, collaborate in an environment designed with robust security, and prepare for meetings that often involve some of the most sensitive information a company has. In recent years, Nasdaq set out to modernize Nasdaq Boardvantage with AI, without compromising security and reliability. That journey was featured in a Microsoft Ignite session, “Nasdaq Boardvantage: AI-Driven Governance on PostgreSQL and Foundry.” It offers a practical look at how Azure Database for PostgreSQL can support AI-driven applications where precision, isolation, and data control are non-negotiable. Introducing AI where trust is everything Board governance isn’t a typical productivity workload. Board packets can run 400 to 600 pages, meeting minutes are legal records, and any AI-generated insight must be confined to a customer’s own data. “Our customers trust us with some of their most strategic, sensitive data,” said Mohsin Shafqat, Senior Manager of Software Development at Nasdaq. That trust meant tackling several core challenges upfront, including: How do you minimize AI hallucinations in a governance context? How do you guarantee tenant isolation at scale? How do you keep data regional across a global customer base? A cloud foundation built for governance Before adding intelligence, Nasdaq decided to re-architect Nasdaq Boardvantage on Microsoft Azure, using Azure Kubernetes Service (AKS) to run containerized, multi-tenant workloads with strong isolation boundaries. Microsoft Foundry provides the managed foundation for deploying, governing, and operating AI models across this architecture, adding consistency, security, and control as intelligence is introduced. At the data layer, Azure Database for PostgreSQL and Azure Database for MySQL became the backbone for governance data. PostgreSQL, in particular, plays a central role in managing structured governance information alongside vector embeddings that support AI-driven features. Together, these services give Nasdaq the performance, security, and operational control required for a highly regulated, multi-tenant environment, while still moving quickly. Key architectural choices included: Tenant isolation by design, with separate databases and storage Regional deployments to align with data residency requirements High availability and managed operations, so teams could focus on product innovation instead of infrastructure maintenance PostgreSQL and pgvector: Powering context-aware AI With that foundation in place, Nasdaq was ready to carefully introduce AI. One of the first AI capabilities was intelligent document summarization. Board materials that once took hours to review could now be condensed into concise, contextually accurate summaries. Under the hood, this required more than just calling an LLM. Nasdaq uses pgvector, natively supported in Azure Database for PostgreSQL, to store and query embeddings generated from board documents. This allows the platform to perform hybrid searches that combine traditional SQL queries with vector similarity to retrieve the most relevant context before sending anything to a language model. Instead of treating AI as a black box, the team built a pipeline where: Documents are processed with Azure Document Intelligence to preserve structure and meaning Content is chunked and embedded Embeddings are stored in PostgreSQL with pgvector Vector similarity searches retrieve precise context for each AI task Because this runs inside PostgreSQL, the same database benefits from Azure’s built-in high availability, security controls, and operational tooling – delivering tangible results, including a 25% reduction in overall board preparation time and internal testing shows 91–97% accuracy for AI-generated summaries and meeting minutes. From summaries to an AI Board Assistant With summarization working in production, Nasdaq expanded further. The team is now building an AI-powered Board Assistant that will help directors prepare for upcoming meetings by surfacing trends, risks, and insights from prior discussions. This introduces a new level of scale. Years of board data across thousands of customers translate into millions of embeddings. PostgreSQL continues to anchor this architecture, storing vectors for semantic retrieval while MySQL supports complementary non-vector workloads. Across Nasdaq Boardvantage, users are advised to always review AI outputs, and no customer data is shared or used to train external models. “We designed AI for governance, not the other way around,” Shafqat said. More importantly, customers trust the system because security, isolation, and data control were engineered in from day one. Looking ahead Nasdaq’s work shows how Azure Database for PostgreSQL can support AI workloads that demand both intelligence and integrity. With PostgreSQL at the core, Nasdaq has built a governance platform that scales globally, respects regulatory boundaries, and introduces AI in a way that feels dependable and not experimental. What started as a modernization of Nasdaq Boardvantage is now influencing how Nasdaq approaches AI across the enterprise. To dive deeper into the architecture and hear directly from the engineers behind it, watch the Ignite session and check out these resources: Watch the Ignite breakout session for a technical walkthrough of how Nasdaq Boardvantage is built, including PostgreSQL on Azure, pgvector, and Microsoft Foundry in production. Read the case study to see how Nasdaq introduced AI into board governance and what changed for directors, administrators, and decision-making. Watch the Ignite broadcast for a candid discussion on Azure Database for PostgreSQL, Azure HorizonDB, and what it takes to scale AI-driven governance.Microsoft at PGConf India 2026
I’m genuinely excited about PGConf India 2026. Over the past few editions, the conference has continued to grow year over year—both in size and in impact—and it has firmly established itself as one of the key events on the global PostgreSQL calendar. That momentum was very evident again in the depth, breadth, and overall quality of the program for PGConf India 2026. Microsoft is proud to be a diamond sponsor for the conference again this year. At Microsoft, we continue our contributions to the upstream PostgreSQL open-source project—as well as to serve our customers with our Postgres managed service offerings, both Azure Database for PostgreSQL and our newest Postgres offering, Azure HorizonDB . On the open-source front, Microsoft had 540 commits in PG18, including major features like Asynchronous IO. We’re also excited to grow our Postgres open-source contributors team, and so happy to welcome Noah Misch to our team. Noah is a Postgres committer who has deep expertise in PostgreSQL security and is focused on correctness and reliability in PostgreSQL’s core. Microsoft at PGConf India 2026: Highlights from Our Speakers PGConf India has several tracks, all of which have some great talks I am looking forward to. First, the plug. 😊 Microsoft has some amazing talks this year, and we have 8 different talks spread across all the tracks. Postgres on Azure : Scaling with Azure HorizonDB, AI, and Developer Workflows, by Aditya Duvuri & Divya Bhargov Resizing shared buffer pool in a running PostgreSQL server: important, yet impossible, by Ashutosh Bapat Ten Postgres Hacker Journeys—and what they teach us, by Claire Giordano How Postgres can leverage disk bandwidth for better TPS, by Nikhil Chawla AWSM FSM! Free Space Maps Decoded by Nikhil Sontakke Journey of developing a Performance Optimization Feature in PostgreSQL, by Rahila Syed Build Agentic AI with Semantic Kernel and Graph RAG on PostgreSQL, by Shriram Muthukrishnan & Palak Chaturvedi All things Postgres @ Microsoft (2026 edition) by Sumedh Pathak Claire is an amazing speaker and has done a lot of work over the last several years documenting and understanding PostgreSQL committers and hackers. Her talk will definitely have some key insights and nuggets of information. Rahila’s talk will go in depth on performance optimization features and how best to test and benchmark them, and all the tools and tricks she has used as part of the feature development. This should be a must-see talk for anyone doing performance work. Diving Deep: Case Studies & Technical Tracks One of the tracks I’m really excited about is the Case Study track. I see these as similar to ‘Experience’ papers in academia. An experience paper documents what actually happened when applying a technique or system in the real world, what worked, what didn’t, and why. One of the talks I’m looking forward to is ‘Operating Postgres Logical Replication at Massive Scale’ by Sai Srirampur from Clickhouse. Logical Replication is an extremely useful tool, and I’m curious to learn more about pitfalls and lessons learnt when running this at large scale. Another interesting one I’m curious to hear is ‘Understanding the importance of the commit log through a database corruption’ by Amit Kumar Singh from EDB. The Database Engine Developers track allows us to go deep into the PostgreSQL code base and get a better understanding of how PostgreSQL is built. Even if you are not a database developer, this track is useful to understand how and why PostgreSQL does things, helping you be a better user of the database. With the rise of larger machines and memory available in the Cloud, different and newer memory architectures/tiers and serverless product offerings, there is a lot of deep dive in PostgreSQL’s memory architecture. There are some great talks focused on this area, which should be must-see for anyone interested in this topic: Resizing shared buffer pool in a running PostgreSQL server: important, yet impossible by Ashutosh Bapat from Microsoft From Disk to Data: Exploring PostgreSQL's Buffer Management by Lalit Choudhary from PurnaBIT Beyond shared_buffers: On-Demand Memory in Modern PostgreSQL by Vaibhav Popat from Google Finally, the Database Administration and Application Developer tracks have some really great content as well. They cover a wide range of topics, from PII data, HA/DR, Query Tuning to connection pooling and understanding conflict detection and resolution. PostgreSQL in India: A Community Effort Worth Celebrating Conferences like these are a rich source of information, dramatically increasing my personal understanding of the product and the ecosystem. Separately, they are also a great way to meet other practitioners in the space and connect with people in the industry. For people in Bangalore, another great option is the PostgreSQL Bangalore Meetup, and I’m super happy that Microsoft was able to join the ranks of other companies to host the eighth iteration of this meetup. Finally, I would be remiss in not mentioning the hard work done by the PGConf India organizing team including Pavan Deolasse, Ashish Mehra, Nikhil Sontakke, Hari Kiran, and Rushabh Lathia who are making all of this happen. Also, a big shout out to the PGConf India Program Committee (Amul Sul, Dilip Kumar, Marc Linster, Thomas Munro, Vigneshwaran C) for putting together an amazing set of talks. I look forward to meeting all of you in Bangalore! Be sure to drop by the Microsoft booth to say hello (and to snag a free pair of our famous socks). I’d love to learn more about how you’re using Postgres.282Views3likes0CommentsAlphaLife Sciences powers regulatory-compliant AI workflows with PostgreSQL on Azure
by: Maxim Lukiyanov, PhD, Principal PM Manager and Sharon Chen, CEO and Founder at AlphaLife Sciences In life sciences, every document is deeply interconnected and highly regulated. Each clinical trial, regulatory submission, safety report, or protocol amendment is expected to stand up to rigorous audit. For AlphaLife Sciences, that challenge became an opportunity to rethink how AI could support expert human judgment. At Microsoft Ignite, AlphaLife Sciences CEO and Founder Sharon Chen shared how her team is building an AI-powered content authoring platform on top of Azure Database for PostgreSQL, designed specifically for the demands of regulated life sciences workflows. She also explained why the team is excited about Azure HorizonDB as a new PostgreSQL service that is built to meet the needs of modern enterprise workloads. This post explores how AlphaLife Sciences uses PostgreSQL as more than a data store. It’s a semantic foundation for compliant, auditable AI agents. Bringing AI into regulated workflows Life sciences organizations are under constant pressure. R&D pipelines are growing and patent windows are shrinking. A single clinical study report can take six months or more to complete, involving multiple teams and hundreds of source documents. Building efficiency into these processes is critical, but only if it doesn’t compromise accuracy, traceability, or compliance. That’s where many AI solutions fall short. Generating text is one thing, but generating verifiable, version-controlled, regulation-aware content is another. AlphaLife Sciences needed agents that could: Work across massive volumes of structured and unstructured data (Word, PDF, Excel, PowerPoint) Maintain full traceability from generated content back to source documents Support audits, amendments, and regulatory review Minimize hallucinations in a zero-tolerance environment Integrate naturally into the tools writers already use Bringing data, search, and AI together in one system At the core of AlphaLife Sciences’ platform is Azure Database for PostgreSQL. The team chose it for flexibility, extensibility, and for how well it supports modern AI workloads. Instead of stitching together separate systems for SQL queries, vector search, text indexing, and metadata tracking, AlphaLife Sciences consolidated everything into PostgreSQL. One of its flagship use cases is clinical trial protocol authoring, a process that typically involves: Designing trial objectives and endpoints Pulling references from previous studies Writing and revising hundreds of pages of structured content Managing multiple rounds of amendments and regulatory feedback With AI agents backed by PostgreSQL, that workflow changes dramatically. When a writer generates a protocol section, the system can automatically retrieve relevant references from a centralized document pool, using semantic search rather than manual lookup. Writers select the sources they want, apply rules or prompts, and let AI draft the section - complete with citations tied back to the original documents. Reviewers can inspect the source, adjust the output, or insert it directly into the document. For protocol amendments, the platform allows teams to upload inputs (Word or Excel), analyze which sections are affected, and generate structured suggestions. Changes are clearly highlighted, compared against previous versions, and summarized in amendment tables. AI agents that respect the rules A recurring theme in Chen’s talk was restraint. “We don’t just need AI that can write,” she said. “We need intelligent agents that understand data structures, follow regulatory laws, and manage version control.” This is where PostgreSQL-backed AI agents shine. By grounding AI behavior in structured schemas, controlled access, and auditable records, automation works hand-in-hand with human experts. AI accelerates first drafts, consistency checks, discrepancy detection, and cross-document analysis, but final accountability stays firmly with professionals. In some cases, the time to complete processes has been reduced by more than 50%. Azure Database for PostgreSQL has become more than a database for AlphaLife Sciences. It’s a semantic knowledge base that supports: Structured and unstructured data Vector similarity search Metadata-driven traceability Compliance, security, and auditability AI agents operating safely inside enterprise constraints By grounding AI agents directly in the database, reasoning, retrieval, and generation all operate against the same governed source of truth. “AI agents are not here to replace human beings,” said Chen. “They extend structured, compliant, and auditable thinking.” What’s next for AlphaLife Sciences with PostgreSQL on Azure Looking ahead, Chen shared her excitement about Azure HorizonDB and the capabilities it brings to PostgreSQL on Azure. Features like in-database AI model management, semantic operators for classification and summarization, and faster vector search with DiskANN align closely with AlphaLife Sciences’ needs as their platform continues to scale. “We’re extremely happy to see the launch of Azure HorizonDB and the more powerful tools coming with it,” Chen said. “By putting everything together in PostgreSQL, we don’t have to rely on different systems for vector search, text indexing, or SQL queries. Everything happens in one streamlined system. The code becomes cleaner, efficiency improves, and the AI agents perform much more elegantly.” Learn more AlphaLife Sciences’ journey was featured during the Microsoft Ignite session “The Blueprint for Intelligent AI Agents Backed by PostgreSQL.” Watch the session to learn more and see a demo of how Azure Database for PostgreSQL transforms the protocol and protocol amendment process. When AI is anchored in a strong PostgreSQL foundation, innovation and compliance don’t have to compete - they can reinforce each other.232Views4likes0CommentsJanuary 2026 Recap: Azure Database for PostgreSQL
We just dropped the 𝗝𝗮𝗻𝘂𝗮𝗿𝘆 𝟮𝟬𝟮𝟲 𝗿𝗲𝗰𝗮𝗽 for Azure Database for PostgreSQL and this one’s all about developer velocity, resiliency, and production-ready upgrades. January 2026 Recap: Azure Database for PostgreSQL • PostgreSQL 18 support via Terraform (create + upgrade) • Premium SSD v2 (Preview) with HA, replicas, Geo-DR & MVU • Latest PostgreSQL minor version releases • Ansible module GA with latest REST API features • Zone-redundant HA now configurable via Azure CLI • SDKs GA (Go, Java, JS, .NET, Python) on stable APIs Read the full January 2026 recap here and see what’s new (and what’s coming) - January 2026 Recap: Azure Database for PostgreSQL