ETL in the Cloud is Made Easy Together with Azure Data Factory and Azure Databricks
Published Feb 23 2020 12:55 PM 11.2K Views
Microsoft

Data engineering in the cloud has emerged as the most crucial aspect of every successful data modernization project in recent years. Without accurate and timely data, business decisions that are based on analytical reports and models can lead to bad results. The life of a data engineer is not always glamorous, and you don’t always receive the credit you deserve. But the importance of the data engineer is undeniable. Microsoft Azure Data Factory's partnership with Databricks provides the Cloud Data Engineer's toolkit that will make your life easier and more productive.

 

etl1.png

The combination of these cloud data services provides you the power to design workflows like the one above. ADF has built-in facilities for workflow control, data transformation, pipeline scheduling, data integration, and many more capabilities to produce quality data at cloud scale and cloud velocity all from a single pane of glass.

 

If you are a data developer who writes and debugs Spark code in Azure Databricks Notebooks, Scala, Jars, Python, SparkSQL, etc. you can point to your data routines directly from an ADF pipeline Databricks activity. Now, you can combine that logic with any of the other activities available in ADF including looping, stored procedures, Azure Functions, REST APIs, and many other activities that allow you optimize other Azure services:

 

etl2.png

ADF provides hooks into your Azure Databricks workspaces to orchestrate your transformation code. So, while you build-up your extensive library of data transformation routines either as code in Databricks Notebooks, or as visual libraries in ADF Data Flows, you can now combine them into pipelines for scheduled ETL pipelines.

 

If you prefer the more visually-oriented approach to data transformation, ADF has built-in data flow capabilities that provide an easy-to-code UI that allows you to construct complex ETL process like this generic approach to a slowly changing dimension:

 

etl3.png

etl4.png

 

Use the ADF visual design canvas to construct ETL pipelines in minutes with live interactive debugging, source control, CI/CD, and monitoring.

 

Whichever paradigm you prefer, Azure Data Factory provides best-in-class tooling for data engineers who are tasked with solving complex data problems at scale using Azure Databricks for data processing.

 

2 Comments
Copper Contributor

It's a nice article however my question is that nowadays we can do most of the data transformation via ADF. And in ADF the underlying technology is like spark as like Databrick. Still wondering why do we need Databrick in this architecture at all?

Deleted
Not applicable

@avixorld I guess you're pointing towards the New Azure Data Flow. But when it comes to more customized ETL or let's say a heavy Data Transformation (with loads of customized Transformations) and you have to choose a framework like Apache Spark then you're left with two options... Databricks or HDInsight... In which Databricks is much more flexible and ready-to-use. Please correct me if I am wrong.

Version history
Last update:
‎Feb 23 2020 12:55 PM
Updated by: