openai
4 TopicsScaling PostgreSQL at OpenAI: Lessons in Reliability, Efficiency, and Innovation
At POSETTE: An Event for Postgres 2025, Bohan Zhang of OpenAI delivered a compelling talk on how OpenAI has scaled Azure Database for PostgreSQL- Flexible Server to meet the demands of one of the world’s most advanced AI platforms running at planetary scale. The Postgres team at Microsoft has partnered deeply with OpenAI for years to enhance the service to meet their performance, scale, and availability requirements, and it is great to see how OpenAI is now deploying and depending on Flexible Server as a core component of ChatGPT. Hearing firsthand about their challenges and breakthroughs is a reminder of what’s possible when innovation meets real-world needs. This blog post captures the key insights from Bohan’s POSETTE talk, paired with how Azure’s cloud platform supports innovation at scale. PostgreSQL at the Heart of OpenAI As Bohan shared during his talk, PostgreSQL is the backbone of OpenAI’s most critical systems. Because PostgreSQL plays a critical role in powering services like ChatGPT, Open AI has prioritized making it more resilient and scalable to avoid any disruptions. That’s why OpenAI has invested deeply in optimizing PostgreSQL for reliability and scale. Why Azure Database for PostgreSQL? OpenAI has long operated PostgreSQL on Azure, initially using a single primary instance without sharding. This architecture worked well—until write scalability limits emerged. Azure’s managed PostgreSQL service provides the flexibility to scale read replicas, optimize performance, and maintain high availability to provide global low latency reads without the burden of managing infrastructure. This is why we designed Azure Database for PostgreSQL to support precisely these kinds of high-scale, mission-critical workloads, and OpenAI’s use case is a powerful validation of that vision. Tackling Write Bottlenecks PostgreSQL’s MVCC (Multi-Version Concurrency Control) design presents challenges for write-heavy workloads—such as index bloat, autovacuum tuning complexity, and version churn. OpenAI addressed this by: Reducing unnecessary writes at the application level Using lazy writes and controlled backfills to smooth spikes Migrating extreme write-heavy workloads with natural sharding keys to other systems. These strategies allowed OpenAI to preserve PostgreSQL’s strengths while mitigating its limitations. Optimizing Read-Heavy Workloads With writes offloaded, OpenAI focused on scaling read-heavy workloads. Key optimizations included: Offloading read queries to replicas Avoiding long-running queries and expensive multi-way join queries Using PgBouncer for connection pooling, reducing latency from 50ms to under 5ms Categorizing requests by priority and assigning dedicated read replicas to high-priority traffic As Bohan noted, “After all the optimization we did, we are super happy with Postgres right now for our read-heavy workloads.” Schema Governance and Resilience OpenAI also implemented strict schema governance to avoid full table rewrites and production disruptions. Only lightweight schema changes are allowed, and long-running queries are monitored to prevent them from blocking migrations. To ensure resilience, we categorized requests by priority and implemented multi-level rate limiting—at the application, connection, and query digest levels. This helped prevent resource exhaustion and service degradation. Takeaway OpenAI’s journey is a masterclass in how to operate PostgreSQL at hyper-scale. By offloading writes, scaling read replicas, and enforcing strict schema governance, OpenAI demonstrated PostgreSQL on Azure meets the demands of cutting-edge AI systems. It also reinforces the value of Azure’s managed database services in enabling teams to focus on innovation rather than infrastructure. We’re proud of the work we’ve done to co-innovate with OpenAI and excited to see how other organizations can apply these lessons to their own PostgreSQL deployments. Check out the on-demand talk “Scaling Postgres to the next level at OpenAI” and many more PostgreSQL community sessions from POSETTE.Optimizing Vector Similarity Search on Azure Data Explorer – Performance Update
This post is co-authored by Anshul_Sharma (Senior Program Manager, Microsoft). This blog is an update of Optimizing Vector Similarity Searches at Scale. We continue to improve the performance of vector similarity search in Azure Data Explorer (Kusto). We present the new functions and policies to maximize performance and the resulting search times. The following table and chart present the search time for the top 3 most similar vectors to a supplied vector: # of vectors Total time [sec.] 25,000 0.03 50,000 0.035 100,000 0.047 200,000 0.062 400,000 0.094 800,000 0.125 1,600,000 0.14 3,200,000 0.15 6,400,000 0.19 12,800,000 0.35 25,600,000 0.55 51,200,000 1.1 102,400,000 2.3 204,800,000 3.9 409,600,000 7.6 This benchmark was done on a medium size Kusto cluster (containing 29 nodes), searching for the most similar vectors in a table of Azure OpenAI embedding vectors. Each vector was generated using ‘text-embedding-ada-002’ embedding model and contains 1536 coefficients. These are the steps to achieve the best performance of similarity search: Use series_cosine_similarity(), the new optimized native function to calculate cosine similarity Set the encoding of the embeddings column to Vector16, the new 16 bit encoding of the vectors coefficients (instead of the default 64 bit) Store the embedding vectors table on all nodes with at least one shard per processor. This can be achieved by limiting the number of embedding vectors per shard by altering ShardEngineMaxRowCount of the sharding policy and RowCountUpperBoundForMerge of the merging policy. Suppose our table contains 1M vectors and our Kusto cluster has 20 nodes each has 16 processors. The table’s shards should contain at most 1000000/(20*16)=3125 rows. These are the KQL commands to create the empty table and set the required policies and encoding: .create table embedding_vectors(vector_id:long, vector:dynamic) // more columns can be added .alter-merge table embedding_vectors policy sharding '{ "ShardEngineMaxRowCount" : 3125 }' .alter-merge table embedding_vectors policy merge '{ "RowCountUpperBoundForMerge" : 3125 }' .alter column embedding_vectors.vector policy encoding type = 'Vector16' Now we can ingest the vectors into the table. And here is a typical search query: let searched_vector = repeat(1536, 0); // to be replaced with real embedding vector. embedding_vectors | extend similarity = series_cosine_similarity_fl(vector, searched_vector, 1, 1) | top 10 by similarity desc The current semantic search times enable usage of ADX as embedding vectors storage platform for RAG (Retrieval Augmented Generation) scenarios and beyond, We continue to improve vector search performance, stay tuned!5KViews4likes2CommentsOptimizing Vector Similarity Searches at Scale
This post is co-authored by @adieldar (Principal Data Scientist, Microsoft) In a previous blog – Azure Data Explorer for Vector Similarity Search, we focused on how Azure Data Explorer (Kusto) is perfectly suited for storing and searching vector embeddings. In this blog, we will focus on performance tuning and optimizations for running vector similarity searches at scale. We will continue working on the Wikipedia scenario where we generate the embeddings of wiki pages using OpenAI and store them in kusto. We then use series_cosine_similarity_fl kusto function to perform similarity searches. Demo scenario Optimizing for scale To optimize the cosine similarity search we need to split the vectors table to many extents that are evenly distributed among all cluster nodes. This can be done by setting Partitioning Policy for the embedding table using the .alter-merge policy partitioning command: .alter-merge table WikipediaEmbeddingsTitleD policy partitioning ``` { "PartitionKeys": [ { "ColumnName": "vector_id_str", "Kind": "Hash", "Properties": { "Function": "XxHash64", "MaxPartitionCount": 2048, // set it to max value create smaller partitions thus more balanced spread among all cluster nodes "Seed": 1, "PartitionAssignmentMode": "Uniform" } } ], "EffectiveDateTime": "2000-01-01" // set it to old date in order to apply partitioning on existing data } ``` In the example above we modified the partitioning policy for WikipediaEmbeddingsTitleD. This table was created from WikipediaEmbeddings by projecting the documents’ title and embeddings. Notes: The partitioning process requires a string key with high cardinality, so we also projected the unique vector_id and converted it to string. The best practice is to create an empty table, modify its partition policy then ingest the data. In that case there is no need to define the old EffectiveDateTime as above. It takes some time after data ingestion until the policy is applied. To test the effect of partitioning we created in a similar manner multiple tables containing up to 1M embedding vectors and tested the cosine similarity performance on clusters with 1, 2, 4, 8 & 20 nodes (SKU Standard_E4d_v5). The following table and chart compare search performance (in seconds) before and after partitioning: Number of Nodes # of vectors 1* (no partitioning) 2 4 8 20 25,000 Vectors 3.4 0.95 0.67 0.57 0.51 50,000 Vectors 6.2 1.5 0.92 0.65 0.55 100,000 Vectors 12.4 2.6 1.55 1 0.57 200,000 Vectors 24.2 5.2 2.8 1.65 0.63 400,000 Vectors 48.5 10.3 5.4 2.95 0.87 800,000 Vectors 96.5 20.5 10.5 6 1.2 1,000,000 Vectors 102 26 13.3 7.2 1.4 * Note that the cluster has 2 nodes, but the tables are stored on a single node (this is our baseline before applying the partitioning policy) You can see that even on the smallest 2 nodes cluster the search speed is improved by more than x4 factor, and in general the speed is inversely proportional to the number of nodes. The number of embedding vectors that are needed for common LLM scenarios (e.g. Retrieval Augmented Generation) rarely exceeds 100K, thus by having 8 nodes searching can be done in 1 sec. How can you get started? If you would like to try this demo, head to the azure_kusto_vector GitHub repository and follow the instructions. The Notebook in the repo will allow you to - Download precomputed embeddings created by OpenAI API. Store the embeddings in ADX. Convert raw text query to an embedding with OpenAI API. Use ADX to perform cosine similarity search in the stored embeddings You can start by - Using KQL Database in Microsoft Fabric by signing up for a free trial - https://aka.ms/try-fabric Spinning up your own free Kusto cluster - https://aka.ms/kustofree We look forward to your feedback and all the exciting things you build with kusto & vectors!7.2KViews3likes4Comments