Moving data to and from Postgres databases is one of the most common operations if you are working with large amounts of data. And it can also be one of the most time-consuming. Imagine you need to collect a massive amount of telemetry and create terabytes of data, and you move all the data to a Postgres database every day. Or maybe you want to change the underlying database and you need to move years of data in your database to another database.
In all these scenarios, and others, you would need to move a lot of data between databases, and you would want to move it fast.
If you’re going to move data into or out of PostgreSQL on Azure then you should know about Azure Data Factory—whether you’re:
- using the Azure Database for PostgreSQL managed service, or
- managing a PostgreSQL database yourself on Azure yourself
then you should know about Azure Data Factory. Azure Data Factory (ADF) is a managed service on Azure that enables you to copy data between databases easily.
And we’ve made ADF faster when used with Postgres on Azure. Both for reads, and for writes.
In this blog post, you’ll learn how you can take advantage of the latest improvements to Azure Data Factory for Postgres to manage your data movement into and out of Postgres so much faster. Specifically, you’ll learn about:
- Postgres use cases for Azure Data Factory
- Faster reads and writes with PostgreSQL and Azure Data Factory
- Speeding up reads in ADF for PostgreSQL: Overview of 2 new “parallel read” features
- New Dynamic Range feature in ADF to enable parallel reads
- Another Parallel Read feature in ADF: using Postgres native partitions
- Speeding up writes with new ADF support for Postgres COPY
Oh, and the ADF speedups you’ll learn about in this post apply to all of the Azure Database for PostgreSQL deployment options, including:
Postgres use cases for Azure Data Factory
With Azure Data Factory (ADF), you connect your datasets to the ADF and create and schedule activities and pipelines. One of the most common use cases of Azure Data Factory is to move data between datasets. A dataset could be a database table, a file, and more. Checkout the list of datasets you can use ADF to move data between. As you can see in the list, you can move data:
- From Azure Blob Storage to PostgreSQL
- From PostgreSQL to a CSV file
- From an on-prem PostgreSQL database to Azure Database for PostgreSQL
- and many many more…
Moving data between datasets is useful for:
- Database migration: If you decide you want to use a different Postgres database service, for example Hyperscale (Citus)—which is a distributed Postgres database service of Azure—you can use ADF to migrate your data from another database to a Hyperscale (Citus) cluster in Azure Database for PostgreSQL.
- Scheduled analytics: If you have a PostgreSQL production database for your application and you need to move new data to another PostgreSQL database on a daily basis—perhaps for doing analytics queries—then Azure Data Factory gives you the tools to schedule the data move activities between datasets on Azure
- No Code Ingestion Pipelines: Maybe one component of your application dumps data (like logs or events) into a file on Azure Blob Storage (any format, perhaps csv or parquet). And you want to build an application to slice and dice your data, to get rich insights on your data. With ADF you can create a data movement pipeline to move your data into Postgres, without writing a single line of code, so your data insight application (such as PowerBI or perhaps a custom application) can consume the data.
As you probably realize by now, you can move data between different (or same) datasets using Azure Data Factory (ADF). You can use Azure Data Factory’s user interface for connecting databases, files, and tables to your data factory.
Some key terminology to know:
- Source: When you run a data move activity, ADF will go to your source dataset—the dataset you want to move data from—and read the data from your source tables.
- Sink: Sink is the dataset you are trying to move data to. When moving data, ADF will convert the data to your sink dataset’s format—and then ADF will insert the converted data into the sink dataset.
Figure 1: To move data from a CSV file to an Azure Database for PostgreSQL server with ADF, Azure Data Factory will (1) read from the source dataset which in this case is a CSV file, (2) convert the data and create a query to ingest the data, and (3) run the query on the Azure Database for PostgreSQL server to write the data into the sink dataset.
To exemplify, if you are trying to move data from a CSV file to an Azure Database for PostgreSQL table, Azure Data Factory will:
- read the data from the CSV file
- convert the comma separated values to a query to ingest the data into PostgreSQL
- run the query on your Azure Database for PostgreSQL server
Note that, in the above example, CSV is the source dataset, and the Postgres table is the sink dataset.
If you are hyped to use Azure Data Factory for your data moving jobs, you are in for a treat. Because now you can move data from and to Postgres so much faster—as a result of some improvements we made in 2021.
Faster reads and writes with PostgreSQL and Azure Data Factory
Azure Data Factory (ADF) moves data in three steps:
- read from source
- convert data format
- write to sink
Let’s explore the improvements to both reads from PostgreSQL and writes to PostgreSQL on Azure. It’s worth mentioning that the reading and writing functionality are two separate and independent modules.
- Source = Postgres on Azure: If you’re using ADF to move data from PostgreSQL on Azure to another database, you will benefit from the “read from Postgres” improvements.
- Sink = Postgres on Azure: Similarly, if you’re moving data from another database into PostgreSQL, you will benefit from the “write to Postgres” improvements.
- Source and sink are both Postgres on Azure: If you’re moving data from a PostgreSQL table to another PostgreSQL table on Azure, you’ll benefit from speedups in both read and write, which will make you much happier.
“Postgres on Azure” in the list above means both:
- Azure Database for PostgreSQL managed services
- any Postgres database you might host on Azure.
Speeding up reads in ADF for PostgreSQL: Overview of 2 new “parallel read” features
If you are familiar with the Citus extension to Postgres, you know we love to make things go fast by parallelizing. Dividing a task into smaller parts and running the smaller parts in parallel usually makes the task finish faster.
When you choose the option to read from a Postgres table in parallel, Azure Data Factory will automatically divide your table into ADF ‘partitions’ and read them in parallel.
Terminology Tip #1: Do not confuse the ADF partitions term with the Postgres partitions term you’re probably already familiar with. Postgres partitions are created with a query like:
CREATE TABLE partition_table PARTITION OF parent_table …
Azure Data Factory now offers 2 new features, both of which enable you to do parallel reads for Postgres tables. The 2 new ADF features are:
- Dynamic Range, where the Postgres table is divided into ADF partitions, based on a column
- Physical Partitions of Table, where ADF reads the existing Postgres partitions (aka “physical partitions”) of the table in parallel
Terminology Tip #2: “parallel read” = “parallel copy” These 2 features that many of us describe as providing “parallel read” features are actually called ‘parallel copy’ in the ADF documentation. However, in this blog post, to reduce confusion with the improvements in the writing part (adding COPY
option alongside with INSERT
) we will use the term ‘parallel read’.
New Dynamic Range feature in ADF to enable parallel reads
The first way to do parallel read with Azure Data Factory is called Dynamic Range. With the Dynamic Range method, you can have ADF divide your source table into ADF partitions (not Postgres partitions, but ADF partitions) based on a column you choose. This column is called the partition column.
Let’s review an example.
Let’s say you have the source_table
that you want to move data from, you choose column a
as the partition column, and let’s say the min and the max values for column a
in your table are 1 and 1000. To partition this table into, let’s say, 5 parts, ADF will create 5 queries like these:
SELECT * FROM source_table WHERE a >= 1 AND a <= 200;
SELECT * FROM source_table WHERE a>= 201 AND a <= 400;
…
SELECT * FROM source_table WHERE a>= 801 AND a <= 1000;
As I mentioned, you can choose the ADF partition column. Some advice about choosing a good ADF partition column:
- A good ADF partition column has roughly same the number of rows in each partition. In the above example, if most of the values of the partition column
a
were between 1 and 200, the partitioning would be useless. Why? Because most of the table would still be read in a singleSELECT
query, with the extra effort for creating the partitions. - The ADF Dynamic Range partition column should have values that scatter nicely, without many hotspots or sparse parts. So, some of the ADF partitions will not take much longer time to move than others, and this will result in a faster data move in total.
The datatypes supported for partition columns are integer, smallint, bigint, serial, smallserial, bigserial, timestamp without time zone, time without time zone, timestamp with time zone, and date.
Finally, if your source table has a primary key, you can skip choosing a partition key and ADF will find and use the primary key. (Only if the primary key is of one of the data types listed above)
Option to choose min and max in ADF
In addition, you can choose the min
and max
values on the partition column you want to move between. For example, if you have a table with a timestamp column and every night you want to move the last day’s data, you can choose the beginning and end of the last day as min and max values. ADF will only move the rows you specified. Again, if you don’t choose min and max, the smallest and biggest values are automatically found in the table.
Option to have more control over ADF’s partition queries
If you want to have more control over the partition queries ADF uses, you can also write your own queries. You can write any query you want. You just need to include some special keywords so ADF can ingest some values for parallel reading. If you want more information on how to create a query with the special syntax check out the documentation for Azure Database for PostgreSQL on ADF, where hooks like?AdfRangePartitionColumnName
, ?AdfRangePartitionLowbound
, and ?AdfRangePartitionUpbound
are explained.
Another Parallel Read feature in ADF: using Postgres native partitions
If your table is a Postgres partitioned table, you can use the new ADF option called Physical Partitions of Table for reading in parallel.
REMINDER of Terminology Tip #1: Remember ADF also uses the term “partitions” to refer to parts of a source table that are read in parallel. Here the term “physical partitions” refers to the actual Postgres partitions.
If you had initially created your source table with a query like:
CREATE TABLE source_table (a INT, b INT …) PARTITION BY …
your table will have physical (Postgres) partitions.
If you created your table with a query like the one above and want to parallelize reading from your table based on the Postgres partitions, you can use the new ADF Physical Partitions of Table option. With this option your source table will be divided into parts with queries like these:
SELECT * FROM source_table_partition_1;
SELECT * FROM source_table_partition_2;
…
Options to give you more control on which Postgres partitions are moved
If you want to move data with ADF from only a subset of the Postgres partitions, you can specify exactly which Postgres partitions you want to move data from. If you don’t, data from all the partitions will be move to the sink table.
Option to have more control over ADF’s partition queries
Like with the Dynamic Range option in ADF, with the Physical Partitions of Table option you can choose to create your own query if you want more control. Again, you will need to use the special syntax and add some keywords to your query so ADF can ingest the values for parallel reading. If you want to learn more about how to write your own query check out the ADF documentation for Azure Database for PostgreSQL where the hooks such as ?AdfTabularPartitionName
are explained.
Speeding up writes with new ADF support for Postgres COPY
When moving data, Azure Data Factory first reads data from source table and after converting the data to the sink data format, writes into the sink dataset. Our second improvement to ADF for PostgreSQL scenarios is in the writing part.
Previously, when you ran a data move activity with ADF to move data to a Postgres table, ADF always used a multi-row INSERT
. And while INSERT
is the recommended technique for real-time data ingestion in Postgres, for bulk data ingestion, Postgres 'COPY' is more performant than multi-row 'INSERT'.
So, we added COPY as a new method for writing data to Postgres tables on ADF. To use COPY
for data ingestion, well, you don’t have to do anything because we went ahead and made COPY
the default in ADF. If you just go and create a data move activity, you will use COPY
by default. This simple change to use COPY
as the default in ADF for Postgres has made a significant impact on the performance.
26.2x faster data migration for ADF with Postgres on Azure
And of course, to validate the benefits of our recent improvements to ADF, we ran some benchmarks. Here are some quick performance results. For comparison we:
- Created two Azure Database for PostgreSQL servers, source and sink, and put data in the source database
- Measured performance without using the new ADF features: First moved the data with
INSERT
(notCOPY
) and single read (not parallel read) - Sped up the writes: Then with
COPY
and single read - Sped up both writes and reads: And finally, with
COPY
and parallel read
We’ve done the tests in 3 steps so the improvements in both COPY
and parallel read can be viewed separately in the three different rows in the table.
ADF settings used for this performance test:
- We used the Dynamic Range option for parallel read.
- We set parallelism factor to 32 for data movement.
- The source and sink datasets were in the same region
Azure Database for PostgreSQL configuration:
- Used the Flexible Server deployment option as source and sink
- 128 GB RAM and 32 vCores (General Purpose D32s_v3 servers)
Data size:
- We moved 100M rows, about 5 GB in size, for this test.
- Side-note: we also performed a second test with data up to 1 TB. The results with the 1 TB of data were looking similar but the single read and
INSERT
approach—which was the previous, slower technique—was taking too long we couldn’t wait for it to finish.
Azure Data Factory provides metrics for the data move activity: we used throughput and total time as they are most relevant in this scenario.
NOTE: we did not do extensive performance tuning, after all this was meant to be a quick test. Tuning the ADF parallelism setting, tuning PostgreSQL settings on source/sink can lead to further improved performance.
Throughput | Total Time | Performance Improvement to Total Time | |
Before: Single Read and INSERT | 848.57 KB/s | 1:09:00 hours | |
Single Read and COPY | 3.096 MB/s | 18:27 minutes | 3.7X faster |
Parallel Read and COPY | 21.593 MB/s | 2:38 minutes | 26.2X faster |
As can be seen in the table, when INSERT
is changed to COPY
there is around a 3.7x difference, and then when single read is changed to parallel read, there is 26.2x improvement in total time.
Although we used the Flexible Server option in Azure Database for PostgreSQL above for the tests, the good news is that the read and write improvements in ADF apply to all of the deployment options for Azure Database for PostgreSQL, including Single Server, Flexible Server, and Hyperscale (Citus)—and even PostgreSQL databases you host on Azure.
If you need to move data to or from Postgres on Azure, check out the faster ADF
In this blog post you learned about how to use Azure Data Factory’s new features to read in parallel based on a column and on Postgres partitions, plus how to use COPY
to write data on Azure Data Factory. You also know how parallel read and COPY
command makes moving data from and to Postgres over 26x faster.
If you want to try and see for yourself, here are some docs links to get you started:
- Azure Data Factory Quickstart: Are you new to ADF? Check out this quickstart to create an Azure Data Factory
- ADF Guide to moving data: If you have an Azure Data Factory, you can check this guide to learn how to copy data between data sets
- Docs about parallel copy (aka parallel read) for Postgres on ADF: To learn details of what’s described in this post check out documentation for parallel copy from Azure Database for PostgreSQL
- Docs for Azure Database for PostgreSQL: If you use Postgres and you’re interested in moving to Azure, you can find more information on our documentation for Azure Database for PostgreSQL
- Docs for Hyperscale (Citus): If you’re new Hyperscale (Citus), our managed service for distributed Postgres on Azure, you can go to the What is Hyperscale (Citus) page in our docs
Updated May 17, 2022
Version 1.0halilozanakgul
Microsoft
Joined February 04, 2021
Azure Database for PostgreSQL Blog
Follow this blog board to get notified when there's new activity