Moving data to and from Postgres databases is one of the most common operations if you are working with large amounts of data. And it can also be one of the most time-consuming. Imagine you need to collect a massive amount of telemetry and create terabytes of data, and you move all the data to a Postgres database every day. Or maybe you want to change the underlying database and you need to move years of data in your database to another database.
In all these scenarios, and others, you would need to move a lot of data between databases, and you would want to move it fast.
If you’re going to move data into or out of PostgreSQL on Azure then you should know about Azure Data Factory—whether you’re:
then you should know about Azure Data Factory. Azure Data Factory (ADF) is a managed service on Azure that enables you to copy data between databases easily.
And we’ve made ADF faster when used with Postgres on Azure. Both for reads, and for writes.
In this blog post, you’ll learn how you can take advantage of the latest improvements to Azure Data Factory for Postgres to manage your data movement into and out of Postgres so much faster. Specifically, you’ll learn about:
Oh, and the ADF speedups you’ll learn about in this post apply to all of the Azure Database for PostgreSQL deployment options, including:
With Azure Data Factory (ADF), you connect your datasets to the ADF and create and schedule activities and pipelines. One of the most common use cases of Azure Data Factory is to move data between datasets. A dataset could be a database table, a file, and more. Checkout the list of datasets you can use ADF to move data between. As you can see in the list, you can move data:
Moving data between datasets is useful for:
As you probably realize by now, you can move data between different (or same) datasets using Azure Data Factory (ADF). You can use Azure Data Factory’s user interface for connecting databases, files, and tables to your data factory.
Some key terminology to know:
To exemplify, if you are trying to move data from a CSV file to an Azure Database for PostgreSQL table, Azure Data Factory will:
Note that, in the above example, CSV is the source dataset, and the Postgres table is the sink dataset.
If you are hyped to use Azure Data Factory for your data moving jobs, you are in for a treat. Because now you can move data from and to Postgres so much faster—as a result of some improvements we made in 2021.
Azure Data Factory (ADF) moves data in three steps:
Let’s explore the improvements to both reads from PostgreSQL and writes to PostgreSQL on Azure. It’s worth mentioning that the reading and writing functionality are two separate and independent modules.
“Postgres on Azure” in the list above means both:
If you are familiar with the Citus extension to Postgres, you know we love to make things go fast by parallelizing. Dividing a task into smaller parts and running the smaller parts in parallel usually makes the task finish faster.
When you choose the option to read from a Postgres table in parallel, Azure Data Factory will automatically divide your table into ADF ‘partitions’ and read them in parallel.
Terminology Tip #1: Do not confuse the ADF partitions term with the Postgres partitions term you’re probably already familiar with. Postgres partitions are created with a query like:
CREATE TABLE partition_table PARTITION OF parent_table …
Azure Data Factory now offers 2 new features, both of which enable you to do parallel reads for Postgres tables. The 2 new ADF features are:
Terminology Tip #2: “parallel read” = “parallel copy” These 2 features that many of us describe as providing “parallel read” features are actually called ‘parallel copy’ in the ADF documentation. However, in this blog post, to reduce confusion with the improvements in the writing part (adding COPY
option alongside with INSERT
) we will use the term ‘parallel read’.
The first way to do parallel read with Azure Data Factory is called Dynamic Range. With the Dynamic Range method, you can have ADF divide your source table into ADF partitions (not Postgres partitions, but ADF partitions) based on a column you choose. This column is called the partition column.
Let’s review an example.
Let’s say you have the source_table
that you want to move data from, you choose column a
as the partition column, and let’s say the min and the max values for column a
in your table are 1 and 1000. To partition this table into, let’s say, 5 parts, ADF will create 5 queries like these:
SELECT * FROM source_table WHERE a >= 1 AND a <= 200;
SELECT * FROM source_table WHERE a>= 201 AND a <= 400;
…
SELECT * FROM source_table WHERE a>= 801 AND a <= 1000;
As I mentioned, you can choose the ADF partition column. Some advice about choosing a good ADF partition column:
a
were between 1 and 200, the partitioning would be useless. Why? Because most of the table would still be read in a single SELECT
query, with the extra effort for creating the partitions.
The datatypes supported for partition columns are integer, smallint, bigint, serial, smallserial, bigserial, timestamp without time zone, time without time zone, timestamp with time zone, and date.
Finally, if your source table has a primary key, you can skip choosing a partition key and ADF will find and use the primary key. (Only if the primary key is of one of the data types listed above)
In addition, you can choose the min
and max
values on the partition column you want to move between. For example, if you have a table with a timestamp column and every night you want to move the last day’s data, you can choose the beginning and end of the last day as min and max values. ADF will only move the rows you specified. Again, if you don’t choose min and max, the smallest and biggest values are automatically found in the table.
If you want to have more control over the partition queries ADF uses, you can also write your own queries. You can write any query you want. You just need to include some special keywords so ADF can ingest some values for parallel reading. If you want more information on how to create a query with the special syntax check out the documentation for Azure Database for PostgreSQL on ADF, where hooks like?AdfRangePartitionColumnName
, ?AdfRangePartitionLowbound
, and ?AdfRangePartitionUpbound
are explained.
If your table is a Postgres partitioned table, you can use the new ADF option called Physical Partitions of Table for reading in parallel.
REMINDER of Terminology Tip #1: Remember ADF also uses the term “partitions” to refer to parts of a source table that are read in parallel. Here the term “physical partitions” refers to the actual Postgres partitions.
If you had initially created your source table with a query like:
CREATE TABLE source_table (a INT, b INT …) PARTITION BY …
your table will have physical (Postgres) partitions.
If you created your table with a query like the one above and want to parallelize reading from your table based on the Postgres partitions, you can use the new ADF Physical Partitions of Table option. With this option your source table will be divided into parts with queries like these:
SELECT * FROM source_table_partition_1;
SELECT * FROM source_table_partition_2;
…
If you want to move data with ADF from only a subset of the Postgres partitions, you can specify exactly which Postgres partitions you want to move data from. If you don’t, data from all the partitions will be move to the sink table.
Like with the Dynamic Range option in ADF, with the Physical Partitions of Table option you can choose to create your own query if you want more control. Again, you will need to use the special syntax and add some keywords to your query so ADF can ingest the values for parallel reading. If you want to learn more about how to write your own query check out the ADF documentation for Azure Database for PostgreSQL where the hooks such as ?AdfTabularPartitionName
are explained.
When moving data, Azure Data Factory first reads data from source table and after converting the data to the sink data format, writes into the sink dataset. Our second improvement to ADF for PostgreSQL scenarios is in the writing part.
Previously, when you ran a data move activity with ADF to move data to a Postgres table, ADF always used a multi-row INSERT
. And while INSERT
is the recommended technique for real-time data ingestion in Postgres, for bulk data ingestion, Postgres 'COPY' is more performant than multi-row 'INSERT'.
So, we added COPY as a new method for writing data to Postgres tables on ADF. To use COPY
for data ingestion, well, you don’t have to do anything because we went ahead and made COPY
the default in ADF. If you just go and create a data move activity, you will use COPY
by default. This simple change to use COPY
as the default in ADF for Postgres has made a significant impact on the performance.
And of course, to validate the benefits of our recent improvements to ADF, we ran some benchmarks. Here are some quick performance results. For comparison we:
INSERT
(not COPY
) and single read (not parallel read)COPY
and single readCOPY
and parallel read
We’ve done the tests in 3 steps so the improvements in both COPY
and parallel read can be viewed separately in the three different rows in the table.
ADF settings used for this performance test:
Azure Database for PostgreSQL configuration:
Data size:
INSERT
approach—which was the previous, slower technique—was taking too long we couldn’t wait for it to finish.
Azure Data Factory provides metrics for the data move activity: we used throughput and total time as they are most relevant in this scenario.
NOTE: we did not do extensive performance tuning, after all this was meant to be a quick test. Tuning the ADF parallelism setting, tuning PostgreSQL settings on source/sink can lead to further improved performance.
Throughput | Total Time | Performance Improvement to Total Time | |
Before: Single Read and INSERT | 848.57 KB/s | 1:09:00 hours | |
Single Read and COPY | 3.096 MB/s | 18:27 minutes | 3.7X faster |
Parallel Read and COPY | 21.593 MB/s | 2:38 minutes | 26.2X faster |
As can be seen in the table, when INSERT
is changed to COPY
there is around a 3.7x difference, and then when single read is changed to parallel read, there is 26.2x improvement in total time.
Although we used the Flexible Server option in Azure Database for PostgreSQL above for the tests, the good news is that the read and write improvements in ADF apply to all of the deployment options for Azure Database for PostgreSQL, including Single Server, Flexible Server, and Hyperscale (Citus)—and even PostgreSQL databases you host on Azure.
In this blog post you learned about how to use Azure Data Factory’s new features to read in parallel based on a column and on Postgres partitions, plus how to use COPY
to write data on Azure Data Factory. You also know how parallel read and COPY
command makes moving data from and to Postgres over 26x faster.
If you want to try and see for yourself, here are some docs links to get you started:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.