Azure Data Factory and Azure Databricks Best Practices

Former Employee

Jan 28, 2022

This post was authored by Leo Furlong, a Solutions Architect at Databricks.

Azure Data Factory (ADF), Synapse pipelines, and Azure Databricks make a rock-solid combo for building your Lakehouse on Azure Data Lake Storage Gen2 (ADLS Gen2). ADF provides the capability to natively ingest data to the Azure cloud from over 100 different data sources. ADF also provides graphical data orchestration and monitoring capabilities that are easy to build, configure, deploy, and monitor in production.

Azure Databricks is the data and AI service from Databricks available through Microsoft Azure to store all of your data on a simple open lakehouse and unify all of your analytics and AI workloads, including data engineering, real-time streaming applications, data science and machine learning, and ad-hoc and BI queries on the lakehouse. Users use Azure Databricks notebooks and Delta Live Tables pipelines to build flexible and scalable enterprise ETL/ELT pipelines to shape and curate data, build and train machine learning models, perform model inferencing, and even stream data into the lakehouse in real-time.

ADF has native integration with Azure Databricks via the Azure Databricks linked service and can execute notebooks, JARs, and Python code activities which enables organizations to build scalable data orchestration pipelines that ingest data from various data sources and curate that data in the lakehouse.

The following list describes 5 key practices when using ADF and Azure Databricks.

Metadata Driven Ingestion Patterns
ADF can be used to create a metadata-driven data ingestion capability that enables organizations to create dynamic pipelines, copy activities, and notebook activities that can read from metadata and generate ADF code that reads from data sources and tables/directories to the lakehouse without any net-new development. Companies can build their ADF ingestion framework once, and rapidly onboard new data sources to the lakehouse simply by adding metadata to the solution framework.
ADF Ingestion to ADLS Landing Zones and Auto Loader or Directly to Delta Lake
There are two common, best practice patterns when using ADF and Azure Databricks to ingest data to ADLS and then execute Azure Databricks notebooks to shape and curate data in the lakehouse.
1. Ingestion using Auto Loader
  ADF copy activities ingest data from various data sources and land data to landing zones in ADLS Gen2 using CSV, JSON, Avro, Parquet, or image file formats. ADF then executes notebook activities to run pipelines in Azure Databricks using Auto Loader. Auto Loader provides an ability to read landing zones in ADLS Gen2 using a batch or streaming pipeline with the following benefits:
  - Incrementally and efficiently processes new data files as they arrive in ADLS Gen2
  - Automatically keeps track of files ingested/processed using file discovery modes
  - Capable of schema inference and evolution of data changes
2. Ingestion directly to Delta Lake
  ADF copy activities can ingest data from various data sources and automatically land data in ADLS Gen2 to the Delta Lake file format using the ADF Delta Lake connector. ADF then executes notebook activities to run pipelines in Azure Databricks.
  
  If you are unfamiliar with the benefits of Delta Lake, make sure to check out this blog post.
Executing an Azure Databricks Job
A lesser known capability, it is extremely easy to execute an Azure Databricks job or a Databricks Delta Live Tables pipeline in ADF using native ADF web activities and the Azure Databricks Jobs API. By executing an Azure Databricks job, you can take advantage of some of the latest job features launching in Azure Databricks like cluster reuse, parameter passing, repair and rerun, and executing Delta Live Tables pipelines.
Pools + Job Clusters
ADF can leverage Azure Databricks pools to create job clusters for notebook activity executions from ADF pipelines. Pools enable Data Engineers to leverage job clusters vs. all-purpose clusters in Azure Databricks without sacrificing latency associated with job cluster spin-up times. Azure Databricks pools reduce cluster start and auto-scaling times by maintaining a set of idle, ready-to-use instances. When a cluster is attached to a pool, cluster nodes are created using the pool’s idle instances. Job clusters from pools provide the following benefits: full workload isolation, reduced pricing, charges billed by the second at the jobs DBU rate, auto-termination at job completion, fault tolerance, and faster job cluster creation. ADF can leverage Azure Databricks pools through the linked service configuration to Azure Databricks. Additionally, the pool to be used can be parameterized using linked service parameters and expressions.
ADF Managed Identity Authentication
ADF can authenticate to Azure Databricks using managed identity authentication or personal access tokens (PAT), but using managed identity is the best practice. Using the system managed identity of ADF to authenticate to Azure Databricks provides a more secure authentication technique and also eliminates the burden of managing personal access tokens by Data Engineers and/or Cloud Administrators.