Bring your data to Azure Database for PostgreSQL Hyperscale (Citus) using Azure Data Factory

Former Employee

Sep 17, 2019

A core value proposition for running your PostgreSQL databases in a fully managed service such as Azure Database for PostgreSQL is richer native integrations with Azure services in cloud ecosystem. Today, we are happy to announce that Azure Database for PostgreSQL is now a supported sink destination in Azure Data Factory. Azure Data Factory is a fully managed data integration service that allows you to create data-driven workflows in a code free visual environment in Azure for orchestrating and automating data movement and data transformation. Let’s dig in to see how you can leverage Azure Data factory for secure one-time data movement or running continuous data pipelines which loads data into Azure Database for PostgreSQL from disparate data sources running on-premises, in Azure or other cloud providers for analytics and reporting.

Azure Data Factory supports collectively across Azure, on-premises, multi-cloud, and SaaS platforms orchestrating secure data transfer. PostgreSQL is the most loved database by developer community owing to rich feature support and extensibility. PostgreSQL, with its superior native JSON, geo-spatial data processing capabilities, and Python and R support is an attractive data platform for data science. The sink support for Azure Database for PostgreSQL now allows you to bring your data (relational, NoSQL, data lake files) to your favorite open source database for storage, processing and reporting.

With our single server deployment option, you get a fully managed, secure, community version of PostgreSQL allowing you to scale-up or down on-demand. For developers and enterprises looking to scale their workload as data volume grows, we recommend Hyperscale (Citus) deployment. Hyperscale (Citus) scales out your data across multiple physical nodes, with the underlying data being sharded into much smaller bits while data remains hosted in the familiar open source PostgreSQL database. For customers, native integration of Azure Data Factory with Azure Database for PostgreSQL unlocks many possible hybrid scenarios and multi-cloud architectures.

Building your data pipeline in ADF to load data into PostgreSQL

In Azure Data Factory, you can use the Copy activity to copy data among data stores located on-premises and in the cloud. After you copy the data, you can use other activities to further transform and analyze it as described here. If you want to copy the data from file system or cloud-based object stores, the copy activity can also read from files in these formats:

Text
JSON
Avro
ORC
Parquet

For up-to-date information on supported data stores and file formats, refer to the service documentation.

To copy data from a source to a sink, the service that runs the Copy activity:

Reads data from a source data store.
Performs serialization/deserialization, compression/decompression, column mapping, and so on. It performs these operations based on the configuration of the input dataset, output dataset, and Copy activity.
Writes data to the sink/destination data store.

At a high level, to use the Copy activity in Azure Data Factory, you need to:

Create linked services for the source data store and the sink data store - Linked services are much like connection strings, which define the connection information that's needed for Data Factory to connect to external resources.

Create datasets for the source and sink -Datasets define the schema, table name, or filename in linked services created in previous step, where the data can be sourced or targeted for copy activity.

Create a pipeline with the Copy activity - In the step you map the source, sink datasets created in previous step and do the column mapping to ensure there are no data conversion error.

For latest up to date information on the Copy activity, please refer to the documentation.

You can author and run a Copy activity data pipeline in Azure Data Factory by using following toolchain

This blog post walks through the Azure portal UI experience. After you create the Azure Data Factory service via the portal by using the steps mentioned in the QuickStart documentation, the Let’s get started page appears. On the Let's get started page, select the pencil icon on the left-hand navigation bar to switch to the Author tab.

Create linked services

After displaying the Author tab, create the Linked Service by selecting Connections and then selecting the New button on the toolbar, as mentioned in the documentation and shown below. For Azure Database for PostgreSQL service, you need to provide the server name, port, database name, username, and password for a user with permissions to connect to the Azure Database for PostgreSQL server and perform schema discovery, read, and write data to the table. In this step, you define the connections for source and target Linked Services.

Create datasets

In this procedure, you create two datasets for source and target by selecting the ‘+’ sign and then selecting Datasets as per the . Under Connections tab, you map it to the Azure Database for PostgreSQL Linked Service defined in previous step by selecting it from the dropdown. Next, you identify the destination table from dropdown in Azure Database for PostgreSQL server as shown below.

Create a pipeline

Next, create a pipeline for data movement and transformation by selecting the ‘+’ (plus) sign and then selecting Pipeline as per the documentation. In the pipeline, map the source datasets and the sink datasets created in previous step and perform column mapping to ensure that there are no data conversion errors moving from source to target, as shown below.

Finally, after defining and creating the pipeline, on the toolbar above the canvas, select Validate to validate pipeline settings. Confirm that the pipeline has been successfully validated. To close the validation output, select the >> (right arrow) button.

Debug and run the Pipeline

After validation, on the pipeline toolbar above the canvas, select Debug to trigger a test run. You can view the status of the pipeline run on the Output tab of the pipeline settings, at the bottom.

After the data pipeline run is successful, you can deploy the pipeline and trigger the run manually or on a schedule. To monitor the pipeline runs, you can switch to the Monitor tab on the left and use select Refresh to refresh the list, as shown below.

Note: ADF uses Npgsql driver to connect to the Azure Database for PostgreSQL instance. The connection timeout is set to 30 seconds as per the default driver configuration. If you are experiencing query timeout as the query is expected to take longer than 30 seconds, you can update the connection timeout from advanced connection properties in the linked service settings for Azure Database for PostgreSQL as shown below

Next steps

Get started by creating a Data pipeline using one of the preferred toolchains bringing data from disparate data sources into Azure Database for PostgreSQL as a destination target.

If you are looking for a jumpstart, you can follow our Azure QuickStart template below which creates a data pipeline that copies from a folder in Azure blob storage to a table in Azure Database for PostgreSQL using ADF v2.