Architecting petabyte-scale analytics by scaling out Postgres on Azure with the Citus extension
Published Oct 31 2019 01:12 PM 33K Views

Update in October 2022: Citus has a new home on Azure! The Citus database is now available as a managed service in the cloud as Azure Cosmos DB for PostgreSQL. Azure documentation links have been updated throughout the post, to point to the new Azure docs.



How do you know if the next update to your software is ready for hundreds of millions of customers? It starts with data. And when it comes to Windows, we’re talking lots of data. The Windows team measures the quality of new software builds by scrutinizing 20,000 diagnostic metrics based on data flowing in from 1.2 billion Windows devices. At the same time, the team evaluates feedback from Microsoft engineers who are using pre-release versions of Windows updates.


At Microsoft, the Windows diagnostic metrics are displayed on a real-time analytics dashboard called “Release Quality View” (RQV), which helps the internal “ship-room” team assess the quality of the customer experience before each new Windows update is released. Given the importance of Windows for Microsoft’s customers, the RQV analytics dashboard is a critical tool for Windows engineers, program managers, and execs. 


Not surprisingly, the real-time analytics dashboard is heavily used. “We have hundreds of active users every day, and thousands every month,” says Min Wei, principal engineer at Microsoft. “Delivering a new operating system update is like producing a Broadway show—there are so many people working behind the scenes to prepare. The RQV analytics dashboard helps ensure the curtain goes up on time—and that we deliver what the audience wants.”



Figure 1: The internal RQV analytics dashboard at Microsoft helps the Windows team to assess the quality of upcoming Windows releases. The RQV dashboard tracks 20,000 diagnostic and quality metrics, and currently supports over 10 million queries per day, with hundreds of concurrent users. The RQV analytics dashboard relies on Postgres—along with the Citus extension to Postgres to scale out horizontally—and is deployed on Microsoft Azure. 


Two days to “cook” the data was unacceptable


The Windows Data and Intelligence team had been using a Lambda architecture for the Online Analytical Processing (OLAP) cubing workloads that powered the RQV analytics dashboard. “OLAP cubing” is the term for the multi-dimensional analysis that data scientists and engineers often use to derive insights from large swaths of data. (In the context of analytics with a database like Postgres, some developers refer to “cubing” as “rollups” or “pre-aggregations” or “aggregate tables.” There are several different names but the idea is the same: if you know that your user will query the data in a certain way, it makes sense to do the calculations in advance.)


Unfortunately, the architecture they were using for OLAP cubing workloads had encountered a significant obstacle. “Our OLAP cubing was running out of steam,” says Wei. “We were generating hundreds of millions of cubes every day through a MapReduce job. But as we added dimensions, we encountered the ‘curse of dimensionality’. We ran out of memory on some of the nodes, and the cubing jobs took longer and longer to finish.” 


Cubing delays (delays creating the rollups and report tables) meant that the Windows team wasn’t able to get the insights they needed when they needed them. “It was taking 24 to 48 hours to ‘cook’—or prepare—the data, which could leave the analytics reporting two to three days behind,” says Wei. “Those kinds of delays made it difficult to meet tight Windows release deadlines.”


Needed: a way to analyze data in dynamic response to users needs—what is called “dynamic OLAP cubing” 


With diagnostic metrics growing at an exponential rate, Wei realized he couldn’t simply deploy more servers and more hardware to solve the cubing problem. The growth of the Windows user base and all the metrics being generated would quickly outpace the resources of any new servers that he provisioned. 


Instead of adding more servers, Wei decided to investigate dynamic cubing. “We were creating hundreds of millions of OLAP cubes, but often users only need a small subset of the OLAP cubes for the dashboard,” says Wei. “I saw an opportunity to address the performance issue by dynamically generating only the cubes people need, as they need them.”


Wei realized he would need a new type of database to deliver the scale and performance required by dynamic OLAP cubing. And Wei realized that this new database should also help increase data durability. “In two years, people should be able to look at the data model and understand it,” says Wei. 


Finding a database for an analytics dashboard that could address the ‘curse of dimensionality’


Wei began to assess different databases, including open source databases such as Apache Pinot, Apache Kylin, and Apache Druid. But these databases had limited SQL support and could not support extensible data types. Nor could they support the concurrent query performance needed to drive the large-scale RQV analytics dashboard. And they didn’t permit updates to fix incorrect data. 


Wei recognized that a relational database management system (RDBMS) could overcome the drawbacks of several of these databases. Most importantly, an RDBMS could deliver the performance needed for dynamic cubing while also meeting his data durability requirements. “I became determined to build or buy an RDBMS,” says Wei. 


As Wei began to search for the right RDBMS, he realized that a horizontal, distributed relational database made the most sense. “We’re not going to see individual servers with 10 TB of memory any time soon,” says Wei. “A distributed memory architecture will be the best choice for at least the next decade.”


Wei also wanted an RDBMS to do OLAP cubing in a way that would help compress fast-growing data volumes through rollup-style aggregation of data. “Windows diagnostics generate tens of billions of rows of data every day. With an append-only data store, you are accumulating all of those rows,” says Wei. “I wanted to use rollups, to implement a sort of ‘semantic compression’ that could reduce the amount of data in the database.”


Wei also required indexing and transactional “upsert” (insert and update) capabilities. In addition, he wanted support for extensible data types, including JSON, key-value types,  HyperLogLog (HLL), and t-digest. Wei considered building a brand new distributed database, but he realized that implementing it would be a long-term project. “I would need to build a database engine on top—and for one person with a deadline of one year, that approach wasn’t feasible,” says Wei. 


“Distributed PostgreSQL is a game changer.”


Min Wei, Principal Engineer at Microsoft


Wei discovered the open source Citus extension to Postgres by listening to a recorded conference talk on his drive home


As Wei’s search for the right database continued, he discovered an open source extension to Postgres (called Citus) while listening to the video on scaling out PostgreSQL. Principal Engineer Marco Slot from Citus Data had given the presentation at the dotScale 2017 conference in Paris. “I learned that Citus was a distributed SQL platform that could meet my requirements,” says Wei. “I realized I really needed to check out Citus.” 


Wei began building a prototype to validate the Citus extension to Postgres. “I was learning Postgres as I built the prototype. At that point, I had little operational experience with Postgres, and there was no one to ask. I was playing with hundreds of different parameters and doing AB testing,” says Wei. “And as I explored Citus support for partitioning, index-only scans, and partial covering index, I saw that I could solve several problems using Citus.”


The fact that Postgres and the Citus extension are open source was important. “In our Data and Intelligence team, data is the most important asset,” says Wei. “And with a popular database platform like Postgres and Citus, I could trust that the code would outlive the professional lives of the current developers. This gave me confidence that the architecture would survive the test of time.”


Impressed with the early results, he transitioned the project from a proof of concept into an official project. “I started talking to the Microsoft Azure teams, and I told them that we needed a distributed RDBMS,” says Wei. “A few months later, I was happy to learn that Microsoft had acquired Citus Data.”



Figure 2: The Windows team’s ship-room decisions employ an architecture that relies on the open source Postgres database, and uses Citus to scale out Postgres horizontally. The “VeniceDB” (code name) system captures over 20K types of metrics from over 1.2B devices; 1.5 PB (petabytes) of Azure blob storage for the raw Windows event data and staging queue; and the Citus database clusters include 54 nodes with a total of 3,456 cores, 27 TB of memory, and 1.6 PB (yes, petabytes) of Premium SSD storage.


Building a Citus database cluster on Microsoft Azure


Wei chose Postgres for his new database—and is using Citus to scale out Postgres horizontally on Microsoft Azure, leveraging Ubuntu/Linux VMs. (The code name for Wei’s system is “VeniceDB.”) The VeniceDB system contains 2 Citus database clusters that both serve the RQV Analytics Dashboard and associated APIs.


The Citus clusters include 32 nodes and 22 nodes (total of 54 nodes), and together the clusters have a total of 3,456 cores and 27 TB of memory. Wei uses 1.6 PB (yes, petabytes) of Azure Premium SSD Managed Disks for the database. And Wei uses another 1.5 PB of Azure blob storage for the staging queue and raw Windows event data.


Wei and his team began using Postgres and Citus in production and on Azure before Citus was integrated as a built-in feature in the Azure Database for PostgreSQL managed service. As a result, Wei and his team had to manage the Postgres database and the Citus extension by themselves, from the very beginning.


But the good news is that today, if you’re looking for ways to scale out Postgres in a cloud database, the Azure Database for PostgreSQL managed service now includes Citus as a built-in deployment option, called Hyperscale (Citus)


In addition to the Citus open source extension to Postgres, Wei also uses several other Postgres extensions. For example, he uses ‘postgresql-topn’ to calculate top occurring items in the data; ‘pg_cron’ to schedule database tasks; ‘intarray’ for manipulating null-free integer arrays; and ‘postgresql-hll’, which offers HyperLogLog data structures as a native data type. 


And Wei’s team has upgraded Postgres fairly often, too. The “VeniceDB” system launched using Postgres 10; when this post was first published in October 2019, the production Citus database clusters were running Postgres 11.5 and Wei had begun testing with Postgres 12. As of an update to this post in April 2021, Wei's 22-node cluster is running Postgres 13.2 and Citus 9.5.4 and Wei's 32-node cluster is being tested with Postgres 13.2 and the Citus 10 release. Says Wei, “I can see and feel the Postgres and Citus improvements over the past two years.”


“We can support hundreds of concurrent users and more than 10 million queries every day….With Citus, response times for 95 percent of queries are less than 1 second across the board.”

Min Wei, Principal Engineer at Microsoft


Powering more than 10M queries/day on the analytics dashboard—with ultra-low latency response times


By using the Citus extension to Postgres to scale out Postgres, Wei’s team now has the I/O performance they need to ingest the large data sets generated by Windows diagnostic tests and user feedback. “My RQV analytics use case is very write-intensive,” says Wei. “We ingest 8 to 10 TB of data into our Postgres database every day, and because of regulatory requirements, we delete about an equal amount each day.”


The system is also very read-intensive. While other databases might hold 5 TB of “dead” data that is rarely read, Wei’s users query up to 200 TB of diagnostic data continuously, as they consume the data on the RQV analytics dashboard. 


“We can support hundreds of concurrent users and more than 10 million queries every day,” says Wei. “Dashboards might refresh every 20 to 25 minutes. As a result, the database must read between 150 and 200 TB of data—hundreds of times, daily. With Citus, response times for 75 percent of queries are less than 150 milliseconds. For some types of queries, the P75 response times are less than 90 milliseconds. And response times for 95% of queries are less than 1 second on both Citus clusters.” 


Scaling out Postgres on Azure with the Citus extension—to power dynamic OLAP cubing


By using Citus to scale out Postgres across multiple nodes, Wei’s team is able to harness more memory and compute, as well as to benefit from the massive parallelism that Citus provides. The result: Wei now has the performance and response times required to do the dynamic cubing and create the rollups. Oh, and with Citus, Wei’s team also has the ability to compute the cubes/rollups incrementally as new data comes in.


“Distributed PostgreSQL is a game changer,” says Wei. “With Citus on Azure, we can create OLAP cubes on demand.”


It no longer takes two to three days for new data to be “cooked” and added to the dashboard. “We can make sure our RQV analytics dashboard users have the latest data about Windows quality right at their fingertips,” says Wei. “As a result, they can strengthen product quality and meet their tight release deadlines.”


“Citus is a petabyte-scale Postgres database that can drive a large, executive analytics dashboard and support hundreds of concurrent queries. Using the combination of Citus and Postgres on Azure was the right approach for us at Microsoft—and I believe it will be the right approach for many other organizations.”


Min Wei, Principal Engineer at Microsoft


Getting control over your data with massively parallel transactions


With engineers and other team members relying on the RQV analytics dashboard to make important ship-room decisions for Windows releases, the data in the database must be accurate. But what happens when bad data is accidentally entered into the dashboard’s underlying database? Wei told this story in his talk at PostgresOpen 2018, on lessons learned building his real-time analytics database:


“Sometimes you ingest the wrong data, and it can be very costly to fix the error,” says Wei. “In one case, I had swallowed almost two weeks of bad data. There was a column that was just wrong. I was between a rock and a hard place. I knew that dropping the table would take me about 7 days. But then I realized that it’s Postgres, and I can do updates. And because of the distributed nature of Citus, I could do updates fast. So I picked a weekend, because query times are low on weekends. And with Postgres and Citus, I was able to update 26 billion rows in just 10 hours. I was really impressed. Other distributed databases don’t give you anywhere near this ability to update the data and improve its accuracy.”


Supporting a petabyte-scale database with Citus on Azure


Wei wholeheartedly recommends using Citus to scale out Postgres. “Some organizations build data lakes when they need a large-scale data warehouse, but instead I think they should take a serious look at Citus,” says Wei. “Citus is a petabyte-scale Postgres database that can drive a large, executive analytics dashboard and support hundreds of concurrent queries. Using the combination of Citus and Postgres on Azure was the right approach for us at Microsoft—and I believe it will be the right approach for many other organizations.” 


And now that Citus is available as a built-in deployment option in Azure Database for PostgreSQL—called Hyperscale (Citus)—you too can scale out Postgres horizontally on Azure, just like Min Wei and the Windows Data and Intelligence team have done, but without having to spend your valuable engineering cycles managing your database. 



Big thank you :smile: to Min Wei, Teresa Giacomini, Greg Thomas, Marco Slot, Jose Miguel Parrella, Craig Kerstiens, Samay Sharma, and Umur Cubukcu for their reviews, edits, and collaboration in creating this story.

Version history
Last update:
‎Jan 30 2023 05:28 PM
Updated by: