Moving Postgres data 26X faster with Azure Data Factory
Published May 17 2022 09:23 AM 9,273 Views
Microsoft

Moving data to and from Postgres databases is one of the most common operations if you are working with large amounts of data. And it can also be one of the most time-consuming. Imagine you need to collect a massive amount of telemetry and create terabytes of data, and you move all the data to a Postgres database every day. Or maybe you want to change the underlying database and you need to move years of data in your database to another database.

 

In all these scenarios, and others, you would need to move a lot of data between databases, and you would want to move it fast.

 

If you’re going to move data into or out of PostgreSQL on Azure then you should know about Azure Data Factory—whether you’re:

 

  • using the Azure Database for PostgreSQL managed service, or
  • managing a PostgreSQL database yourself on Azure yourself

then you should know about Azure Data Factory. Azure Data Factory (ADF) is a managed service on Azure that enables you to copy data between databases easily.

 

And we’ve made ADF faster when used with Postgres on Azure. Both for reads, and for writes.

 

In this blog post, you’ll learn how you can take advantage of the latest improvements to Azure Data Factory for Postgres to manage your data movement into and out of Postgres so much faster. Specifically, you’ll learn about:

 

 

Oh, and the ADF speedups you’ll learn about in this post apply to all of the Azure Database for PostgreSQL deployment options, including:

Fast-modern-purple-train-drawing-representing-fast-Postgres-data-movement-with-ADF-1920x1080.png

 

Postgres use cases for Azure Data Factory

 

With Azure Data Factory (ADF), you connect your datasets to the ADF and create and schedule activities and pipelines. One of the most common use cases of Azure Data Factory is to move data between datasets. A dataset could be a database table, a file, and more. Checkout the list of datasets you can use ADF to move data between. As you can see in the list, you can move data:

 

  • From Azure Blob Storage to PostgreSQL
  • From PostgreSQL to a CSV file
  • From an on-prem PostgreSQL database to Azure Database for PostgreSQL
  • and many many more…

 

Moving data between datasets is useful for:

  • Database migration: If you decide you want to use a different Postgres database service, for example Hyperscale (Citus)—which is a distributed Postgres database service of Azure—you can use ADF to migrate your data from another database to a Hyperscale (Citus) cluster in Azure Database for PostgreSQL.
  • Scheduled analytics: If you have a PostgreSQL production database for your application and you need to move new data to another PostgreSQL database on a daily basis—perhaps for doing analytics queries—then Azure Data Factory gives you the tools to schedule the data move activities between datasets on Azure
  • No Code Ingestion Pipelines: Maybe one component of your application dumps data (like logs or events) into a file on Azure Blob Storage (any format, perhaps csv or parquet). And you want to build an application to slice and dice your data, to get rich insights on your data. With ADF you can create a data movement pipeline to move your data into Postgres, without writing a single line of code, so your data insight application (such as PowerBI or perhaps a custom application) can consume the data.

 

As you probably realize by now, you can move data between different (or same) datasets using Azure Data Factory (ADF). You can use Azure Data Factory’s user interface for connecting databases, files, and tables to your data factory.

 

Some key terminology to know:

  • Source: When you run a data move activity, ADF will go to your source dataset—the dataset you want to move data from—and read the data from your source tables.
  • Sink: Sink is the dataset you are trying to move data to. When moving data, ADF will convert the data to your sink dataset’s format—and then ADF will insert the converted data into the sink dataset.

Figure 1: To move data from a CSV file to an Azure Database for PostgreSQL server with ADF, Azure Data Factory will (1) read from the source dataset which in this case is a CSV file, (2) convert the data and create a query to ingest the data, and (3) run the query on the Azure Database for PostgreSQL server to write the data into the sink dataset.Figure 1: To move data from a CSV file to an Azure Database for PostgreSQL server with ADF, Azure Data Factory will (1) read from the source dataset which in this case is a CSV file, (2) convert the data and create a query to ingest the data, and (3) run the query on the Azure Database for PostgreSQL server to write the data into the sink dataset.

 

To exemplify, if you are trying to move data from a CSV file to an Azure Database for PostgreSQL table, Azure Data Factory will:

  • read the data from the CSV file
  • convert the comma separated values to a query to ingest the data into PostgreSQL
  • run the query on your Azure Database for PostgreSQL server

 

Note that, in the above example, CSV is the source dataset, and the Postgres table is the sink dataset.

 

If you are hyped to use Azure Data Factory for your data moving jobs, you are in for a treat. Because now you can move data from and to Postgres so much faster—as a result of some improvements we made in 2021.

 

Faster reads and writes with PostgreSQL and Azure Data Factory

 

Azure Data Factory (ADF) moves data in three steps:

  1. read from source
  2. convert data format
  3. write to sink

 

Let’s explore the improvements to both reads from PostgreSQL and writes to PostgreSQL on Azure.  It’s worth mentioning that the reading and writing functionality are two separate and independent modules.

  • Source = Postgres on Azure: If you’re using ADF to move data from PostgreSQL on Azure to another database, you will benefit from the “read from Postgres” improvements.
  • Sink = Postgres on Azure: Similarly, if you’re moving data from another database into PostgreSQL, you will benefit from the “write to Postgres” improvements.
  • Source and sink are both Postgres on Azure: If you’re moving data from a PostgreSQL table to another PostgreSQL table on Azure, you’ll benefit from speedups in both read and write, which will make you much happier.

 

“Postgres on Azure” in the list above means both:

  • Azure Database for PostgreSQL managed services
  • any Postgres database you might host on Azure.

 

Speeding up reads in ADF for PostgreSQL: Overview of 2 new “parallel read” features

 

If you are familiar with the Citus extension to Postgres, you know we love to make things go fast by parallelizing. Dividing a task into smaller parts and running the smaller parts in parallel usually makes the task finish faster.

 

When you choose the option to read from a Postgres table in parallel, Azure Data Factory will automatically divide your table into ADF ‘partitions’ and read them in parallel.

 

Terminology Tip #1: Do not confuse the ADF partitions term with the Postgres partitions term you’re probably already familiar with. Postgres partitions are created with a query like:

 

CREATE TABLE partition_table PARTITION OF parent_table …

 

Azure Data Factory now offers 2 new features, both of which enable you to do parallel reads for Postgres tables. The 2 new ADF features are:

  1. Dynamic Range, where the Postgres table is divided into ADF partitions, based on a column
  2. Physical Partitions of Table, where ADF reads the existing Postgres partitions (aka “physical partitions”) of the table in parallel

 

Terminology Tip #2: “parallel read” = “parallel copy” These 2 features that many of us describe as providing “parallel read” features are actually called ‘parallel copy’ in the ADF documentation. However, in this blog post, to reduce confusion with the improvements in the writing part (adding COPY option alongside with INSERT) we will use the term ‘parallel read’.

 

New Dynamic Range feature in ADF to enable parallel reads

 

The first way to do parallel read with Azure Data Factory is called Dynamic Range. With the Dynamic Range method, you can have ADF divide your source table into ADF partitions (not Postgres partitions, but ADF partitions) based on a column you choose. This column is called the partition column.

 

Let’s review an example.

 

Let’s say you have the source_table that you want to move data from, you choose column a as the partition column, and let’s say the min and the max values for column a in your table are 1 and 1000. To partition this table into, let’s say, 5 parts, ADF will create 5 queries like these:

 

SELECT * FROM source_table WHERE a >= 1 AND a <= 200;
SELECT * FROM source_table WHERE a>= 201 AND a <= 400;
…
SELECT * FROM source_table WHERE a>= 801 AND a <= 1000;

 

As I mentioned, you can choose the ADF partition column. Some advice about choosing a good ADF partition column:

 

  • A good ADF partition column has roughly same the number of rows in each partition. In the above example, if most of the values of the partition column a were between 1 and 200, the partitioning would be useless. Why? Because most of the table would still be read in a single SELECT query, with the extra effort for creating the partitions.
  • The ADF Dynamic Range partition column should have values that scatter nicely, without many hotspots or sparse parts.  So, some of the ADF partitions will not take much longer time to move than others, and this will result in a faster data move in total.

 

The datatypes supported for partition columns are integer, smallint, bigint, serial, smallserial, bigserial, timestamp without time zone, time without time zone, timestamp with time zone, and date.

 

Finally, if your source table has a primary key, you can skip choosing a partition key and ADF will find and use the primary key. (Only if the primary key is of one of the data types listed above)

 

Option to choose min and max in ADF

In addition, you can choose the min and max values on the partition column you want to move between. For example, if you have a table with a timestamp column and every night you want to move the last day’s data, you can choose the beginning and end of the last day as min and max values. ADF will only move the rows you specified. Again, if you don’t choose min and max, the smallest and biggest values are automatically found in the table.

 

Option to have more control over ADF’s partition queries

If you want to have more control over the partition queries ADF uses, you can also write your own queries. You can write any query you want. You just need to include some special keywords so ADF can ingest some values for parallel reading. If you want more information on how to create a query with the special syntax check out the documentation for Azure Database for PostgreSQL on ADF, where hooks like?AdfRangePartitionColumnName, ?AdfRangePartitionLowbound, and ?AdfRangePartitionUpbound are explained.

 

Another Parallel Read feature in ADF: using Postgres native partitions

 

If your table is a Postgres partitioned table, you can use the new ADF option called Physical Partitions of Table for reading in parallel.

 

REMINDER of Terminology Tip #1: Remember ADF also uses the term “partitions” to refer to parts of a source table that are read in parallel. Here the term “physical partitions” refers to the actual Postgres partitions.

 

If you had initially created your source table with a query like:

 

CREATE TABLE source_table (a INT, b INT …) PARTITION BY …

 

your table will have physical (Postgres) partitions.

 

If you created your table with a query like the one above and want to parallelize reading from your table based on the Postgres partitions, you can use the new ADF Physical Partitions of Table option. With this option your source table will be divided into parts with queries like these:

 

SELECT * FROM source_table_partition_1;
SELECT * FROM source_table_partition_2;
…

 

Options to give you more control on which Postgres partitions are moved

If you want to move data with ADF from only a subset of the Postgres partitions, you can specify exactly which Postgres partitions you want to move data from. If you don’t, data from all the partitions will be move to the sink table.

 

Option to have more control over ADF’s partition queries 

Like with the Dynamic Range option in ADF, with the Physical Partitions of Table option you can choose to create your own query if you want more control. Again, you will need to use the special syntax and add some keywords to your query so ADF can ingest the values for parallel reading. If you want to learn more about how to write your own query check out the ADF documentation for Azure Database for PostgreSQL where the hooks such as ?AdfTabularPartitionName are explained.

 

Speeding up writes with new ADF support for Postgres COPY

 

When moving data, Azure Data Factory first reads data from source table and after converting the data to the sink data format, writes into the sink dataset. Our second improvement to ADF for PostgreSQL scenarios is in the writing part.

 

Previously, when you ran a data move activity with ADF to move data to a Postgres table, ADF always used a multi-row INSERT. And while INSERT is the recommended technique for real-time data ingestion in Postgres, for bulk data ingestion, Postgres 'COPY' is more performant than multi-row 'INSERT'.

 

So, we added COPY as a new method for writing data to Postgres tables on ADF. To use COPY for data ingestion, well, you don’t have to do anything because we went ahead and made COPY the default in ADF. If you just go and create a data move activity, you will use COPY by default. This simple change to use COPY as the default in ADF for Postgres has made a significant impact on the performance.

26.2x faster data migration for ADF with Postgres on Azure

 

And of course, to validate the benefits of our recent improvements to ADF, we ran some benchmarks. Here are some quick performance results. For comparison we:

 

  • Created two Azure Database for PostgreSQL servers, source and sink, and put data in the source database
  • Measured performance without using the new ADF features: First moved the data with INSERT (not COPY) and single read (not parallel read)
  • Sped up the writes: Then with COPY and single read
  • Sped up both writes and reads: And finally, with COPY and parallel read

 

We’ve done the tests in 3 steps so the improvements in both COPY and parallel read can be viewed separately in the three different rows in the table.

 

ADF settings used for this performance test:

  • We used the Dynamic Range option for parallel read.
  • We set parallelism factor to 32 for data movement.
  • The source and sink datasets were in the same region

Azure Database for PostgreSQL configuration:

Data size:

  • We moved 100M rows, about 5 GB in size, for this test.
  • Side-note: we also performed a second test with data up to 1 TB. The results with the 1 TB of data were looking similar but the single read and INSERT approach—which was the previous, slower technique—was taking too long we couldn’t wait for it to finish. :lol:

 

Azure Data Factory provides metrics for the data move activity: we used throughput and total time as they are most relevant in this scenario.

 

NOTE: we did not do extensive performance tuning, after all this was meant to be a quick test. Tuning the ADF parallelism setting, tuning PostgreSQL settings on source/sink can lead to further improved performance.

 

  Throughput Total Time Performance Improvement to Total Time
Before: Single Read and INSERT 848.57 KB/s 1:09:00 hours  
Single Read and COPY 3.096 MB/s 18:27 minutes 3.7X faster
Parallel Read and COPY 21.593 MB/s 2:38 minutes 26.2X faster

 

As can be seen in the table, when INSERT is changed to COPY there is around a 3.7x difference, and then when single read is changed to parallel read, there is 26.2x improvement in total time.

 

Although we used the Flexible Server option in Azure Database for PostgreSQL above for the tests, the good news is that the read and write improvements in ADF apply to all of the deployment options for Azure Database for PostgreSQL, including Single Server, Flexible Server, and Hyperscale (Citus)—and even PostgreSQL databases you host on Azure.


If you need to move data to or from Postgres on Azure, check out the faster ADF

 

In this blog post you learned about how to use Azure Data Factory’s new features to read in parallel based on a column and on Postgres partitions, plus how to use COPY to write data on Azure Data Factory. You also know how parallel read and COPY command makes moving data from and to Postgres over 26x faster.

 

If you want to try and see for yourself, here are some docs links to get you started:

 

Co-Authors
Version history
Last update:
‎May 17 2022 09:23 AM
Updated by: