Today, we are excited to announce the capability to run Apache Airflow DAGs (Directed Acyclic Graph) within Azure Data Factory, adding a key Open-Source integration that provides extensibility for orchestrating python-based workflows at scale on Azure.
Note: This feature is in Public Preview.
Azure Data Factory Managed Airflow provides a managed orchestration service for Apache Airflow that simplifies the creation and management of Airflow environments. In addition, it natively integrates Apache Airflow with Azure Active Directory for a single sign-on (SSO) and a more secure solution (instead of requiring basic auth for logins).
What is Apache Airflow?
Apache Airflow is an open-source tool used to programmatically author, schedule, and monitor sequences of processes and tasks referred to as "workflows."
In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized to reflect their relationships and dependencies. A DAG is defined in a Python script, representing the DAGs structure (tasks and their dependencies) as code.
Azure Data Factory enables data engineers to bring their existing Apache Airflow workflows / DAGs into ADF that runs on a fully managed Airflow Environment (also referred to as Airflow Integration runtime). It brings the best of both worlds (Azure and Apache Foundation) by offering Azure Data Factory's reliability, scale, security, and ease of management, with Apache Airflow's extensibility and community-led updates as a managed offering on Azure.
When to use Managed Airflow?
Azure Data Factory offers Pipelines to orchestrate data processes (UI-based authoring) visually. While Managed Airflow offers Apache Airflow-based python DAGs (python code-centric authoring) for defining the data orchestration process.
- If you have the Apache Airflow background or are currently using Apace Airflow, you may prefer to use the Managed Airflow instead of the pipelines.
- On the contrary, if you would not like to write/ manage python-based DAGs for data process orchestration, you may prefer to use pipelines.
With Managed Airflow, Azure Data Factory now offers multi-orchestration capabilities across visual, code-centric, OSS orchestration requirements.
What are the benefits of using Azure Data Factory Managed Airflow?
With the Azure Data Factory Managed Airflow, you can use Airflow and Python skills to create data workflows without managing the underlying infrastructure for scalability, availability, and security.
- Automatic Airflow setup – Quickly set up Apache Airflow by choosing an Apache Airflow version when you create a Managed Airflow environment. Azure Data Factory Managed Airflow sets up Apache Airflow for you using the same open-source code you can download on the Internet and provides the same familiar User Interface once set up and launched.
- Automatic scaling – Automatically scale Apache Airflow workers by setting the minimum and maximum number of workers in your environment. Azure Data Factory Managed Airflow monitors the workers in your environment. It uses its autoscaling component to add workers to meet demand until it reaches the maximum number of workers you defined.
- Built-in authentication – Enables Azure Active Directory (AAD) role-based authentication and authorization for your Apache Airflow Web server by defining AAD RBAC's access control policies offering SSO (single sign-on) capability for Azure users.
- Built-in security – Metadata is also automatically encrypted by Azure-managed keys, so your environment is secure by default. Additionally, it supports double encryption with a Customer-Managed Key (CMK).
- Streamlined upgrades and patches – Azure Data Factory Managed Airflow periodically provide new versions of Apache Airflow. The ADF Managed Airflow team will auto-update and patch the minor versions.
- Workflow monitoring – View Apache Airflow logs and Apache Airflow metrics in Azure Monitor to identify Apache Airflow task delays or workflow errors without needing additional third-party tools. ADF Managed Airflow automatically sends environment metrics—and if enabled—Apache Airflow logs to Azure Monitor.
- Azure integration – ADF Managed Airflow supports open-source integrations with Azure Data Factory pipelines, Azure Batch, Azure CosmosDB, Azure Key Vault, ACI, ADLS Gen2, Azure Kusto, as well as hundreds of built-in and community-created operators and sensors.
How can I get started?
Steps to get started:
- Create a new or use an existing Azure Data Factory in a region that supports Managed Airflow (East US, South Central US, West US, UK South, North Europe, West Europe, Southeast Asia, Germany West Central)
- Create a new Airflow environment
- Prepare and Import DAGs (steps)
- Upload your DAGs in an Azure Blob Storage. Create a container or folder path names ‘dags’ and add your existing DAG files into the ‘dags’ container/ path.
- Import the DAGs into the Airflow environment
- Launch and monitor Airflow DAG runs
We are always open to feedback so please let us know your thoughts in the comments below or add to our Ideas forum.