PostgreSQL is an excellent database for a wide range of workloads. Traditionally, the only problem with Postgres is that it is limited to a single machine. If you are using the Azure Database for PostgreSQL managed service, that limitation no longer applies to you because you can use the built-in Hyperscale (Citus) option—to transparently shard and scale out both transactional and analytical workloads. And Hyperscale (Citus) just keeps getting better and better.
The heart of Hyperscale (Citus) is the open source Citus extension which extends Postgres with distributed database superpowers. Every few months we release a new version of Citus. I’m excited to tell you that the latest release, Citus 10, is now available in preview on Hyperscale (Citus) and comes with spectacular new capabilities:
These new Citus 10 capabilities change what Hyperscale (Citus) can do for you in some fundamental (and useful) ways.
With Citus 10, Hyperscale (Citus) is no longer just about sharding Postgres: you can use the new Citus columnar storage feature to compress large data sets. And Citus is no longer just about multi-node clusters: with Basic Tier in Hyperscale (Citus), you can now shard on a single node to be “scale-out-ready”. Finally, Hyperscale (Citus) is no longer just about transforming Postgres into a distributed database: you can now mix regular (local) Postgres tables and distributed tables in the same Postgres database.
In short, Hyperscale (Citus) in Azure Database for PostgreSQL now empowers you to run Postgres at any scale.
Let’s dive in!
The data sizes of some new Hyperscale (Citus) customers are truly gigantic, which meant we needed a way to lower storage cost and get more out of the hardware. That is why we implemented columnar storage for Citus. Citus Columnar can give you compression ratios of 3x-10x or more, and even greater I/O reductions. The new Citus columnar feature is available in:
The best part: you can use columnar in Hyperscale (Citus) with or without the Citus scale-out features! More details about columnar table storage can be found in our Hyperscale (Citus) docs.
Our Citus engineering team has a long history with columnar storage in PostgreSQL, as we originally developed the cstore_fdw extension which offered columnar storage via the foreign data wrapper (fdw) API. PostgreSQL 12 introduced “table access methods”, which allows extensions to define custom storage formats in a much more native way.
Citus makes columnar storage available in PostgreSQL via the table access method APIs, which means that you can now create Citus columnar tables by simply adding
USING columnar when creating a table:
CREATE TABLE order_history (…) USING columnar;
If you provision a row-based (“heap”) table that you’d like to later convert to columnar, you can do that too, using the
-- compress a table using columnar storage (indexes are dropped) SELECT alter_table_set_access_method('orders_2019', 'columnar');
When you use Citus columnar storage, you will typically see a 60-90% reduction in data size. In addition, Citus columnar will only read the columns used in the SQL query. This can give dramatic speed ups for I/O bound queries, and a big reduction in storage cost.
cstore_fdw, Citus columnar has a better compression ratio thanks to zstd compression. Citus columnar also supports rollback, streaming replication, archival, and pg_upgrade.
There are still a few limitations with Citus columnar to be aware of: Indexes and update/delete are not yet supported, and it is best to avoid single-row inserts, since compression only works well in batches. We plan to address these limitations in future Citus releases, but you can also avoid them using partitioning.
If you partition time series tables by time, you can use row-based storage for recent partitions to enable single-row, update/delete/upsert and indexes—while using columnar storage to archive data that is no longer changing. To make this easy, we also added a function to compress all your old partitions in one go:
-- compress all partitions older than 7 days CALL alter_old_partitions_set_access_method('order_history', now() – interval '7 days', 'columnar');
This procedure commits after every partition to release locks as quickly as possible. You can use pg_cron to run this new alter function as a nightly compression job.
To learn more, check out Jeff Davis’ blog post: Citus 10 brings columnar compression to Postgres. Jeff also created a video demo, if you’re a more visual person this Citus columnar demo might be a good way to get acquainted.
We often think of Hyperscale (Citus) as “worry-free Postgres”, because Citus takes away the one concern you may have when choosing Postgres as your database: reaching the limits of a single node. However, when you migrate a complex application from Postgres to Hyperscale (Citus), you may need to make some changes to your application to handle restrictions around unique- and foreign key-constraints and joins, since not every PostgreSQL feature has an efficient distributed implementation.
In Azure, the easiest way to scale your application on Postgres without ever facing the cost of migration (and be truly worry-free) is to use Hyperscale (Citus) from day one, when you first build your application. Applications built on Citus are always 100% compatible with regular PostgreSQL, so there is no risk of lock-in. The only downside of starting on Hyperscale (Citus) so far was the cost and complexity of running a distributed database cluster, but this changes in Citus 10. With Citus 10 and the new Basic tier in Hyperscale (Citus), you can now shard your Postgres tables on a single Citus node to make your database “scale-out-ready”.
To get started with Hyperscale (Citus) on a single node, this post about sharding Postgres with Basic tier is a good place to start. Be sure to enable preview features in the Azure portal when provisioning Azure Database for PostgreSQL—and then select the new “Basic Tier” feature that’s available in preview on Hyperscale (Citus). As of today, you can provision Basic tier for $0.27 USD/hour in US East 1. This means that you can try out Hyperscale (Citus) at a much lower price point: about ~8 hours of kicking the tires and you’ll only pay $2-3 USD.
Once connected, you can create your first distributed table by running the following commands:
CREATE TABLE data (key text primary key, value jsonb not null); SELECT create_distributed_table('data', 'key');
create_distributed_table function will divide the table across 32 (hidden) shards that can be moved to new nodes when a single node is no longer sufficient.
You may experience some overhead from distributed query planning, but you will also see benefits from multi-shard queries being parallelized across cores. You can also make distributed, columnar tables to take advantage of both I/O and storage reduction and parallelism.
The biggest advantage of distributing Postgres tables with Basic tier in Hyperscale (Citus) is that your database will be ready to be scaled out using the Citus shard rebalancer.
With the new Basic Tier feature in Hyperscale (Citus) and the shard rebalancer, you can be ready to scale out by distributing your tables. However, distributing tables does involve certain trade-offs, such as extra network round trips when querying shards on worker nodes, and a few unsupported SQL features.
If you have a very large Postgres table and a data-intensive workload (e.g. the frequently-queried part of the table exceeds memory), then the performance gains from distributing the table over multiple nodes with Hyperscale (Citus) will vastly outweigh any downsides. However, if most of your other Postgres tables are small, then you might end up having to make additional changes without much additional benefit.
A simple solution would be to not distribute the smaller tables at all. In most Hyperscale (Citus) deployments, your application connects to a single coordinator node (which is usually sufficient), and the coordinator is a fully functional PostgreSQL node. That means you could organize your database as follows:
That way, you can scale out CPU, memory, and I/O where you need it. And minimize application changes and other trade-offs where you don’t. To make this model work seamlessly, Citus 10 adds support for 2 important features:
With these new Citus 10 features in Hyperscale (Citus), you can mix and match PostgreSQL tables and Citus tables to get the best of both worlds without having to separate them in your data model.
When you distribute a Postgres table with Hyperscale (Citus), choosing your distribution column is an important step, since the distribution column (sometimes called the sharding key) determines which constraints you can create, how (fast) you can join tables, and more.
Citus 10 adds the
alter_distributed_table function so you can change the distribution column, shard count, and co-location of a distributed table. This blog post walks through how when and why to use alter_distributed_table with Hyperscale (Citus).
-- change the distribution column to customer_id SELECT alter_distributed_table('orders', distribution_column := 'customer_id'); -- change the shard count to 120 SELECT alter_distributed_table('orders', shard_count := 120); -- Co-locate with another table SELECT alter_distributed_table('orders', distribution_column := 'product_id', colocate_with := 'products');
alter_distributed_table reshuffles the data between the worker nodes, which means it is fast and works well on very large tables. We expect this makes it much easier to experiment with distributing your tables without having to reload your data.
You can also use the
alter_distributed_table function in production (it’s fully transactional!), but you do need to (1) make sure that you have enough disk space to store the table several times, and (2) make sure that your application can tolerate blocking all writes to the table for a while.
And there’s more!
citus_tablesview shows Citus tables and their distribution column, total size, and access method. The
citus_shardsview shows the names, locations, and sizes of individual shards.
If you are as excited as we are and want to play with these new Citus 10 features, doing so is now easier than ever.
# run PostgreSQL with Citus on port 5500 docker run -d --name citus -p 5500:5432 -e POSTGRES_PASSWORD=mypassword citusdata/citus # connect psql -U postgres -d postgres -h localhost -p 5500
You can also check out our lovely new Getting started with Citus page for more resources on how to get started—my teammates have curated some good learning tools there, whether your preferred learning mode is reading, watching, or doing.
And since the Citus 10 open source release rolled out, we’ve also published a bunch of deep-dive blog posts (plus a demo!) about the spectacular new capabilities in Citus 10:
Finally, a big thank you to all of you who use Hyperscale (Citus) to scale out Postgres and who have taken the time to give feedback and be part of our journey. If you’ve filed issues on GitHub, submitted PRs, talked to our @citusdata or @AzureDBPostgres team on Twitter, signed up for our monthly Citus technical newsletter, or joined our Citus Public community Q&A... well, thank you. And please, keep the feedback coming. You can always reach our product team via the Ask Azure DB for PostgreSQL email address too.
We can’t wait to see what you do with the new Citus 10 features in Hyperscale (Citus)!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.