Blog Post

Healthcare and Life Sciences Blog
7 MIN READ

Part 1 - Unlock the Power of Azure Data Factory: A Guide to Boosting Your Data Ingestion Process

Joe_Fitzgerald's avatar
Jul 06, 2023

Introduction

In the fast-paced world of cloud architecture, securely collecting, ingesting, and preparing data for health care industry solutions has become an essential requirement. And that's where Azure Data Factory (ADF) comes in. As an Extract, Transform, and Load (ETL) cloud service, ADF empowers you to scale-out serverless data integration and data transformation with ease.

 

Imagine being able to effortlessly create data-driven workflows that orchestrate data movement and transform massive amounts of data in a single stroke. With ADF's code-free UI, intuitive authoring, and comprehensive monitoring and management capabilities, you can turn this vision into reality.

 

The health care industry presents numerous opportunities where ADF can play a pivotal role. From donor-patient cross-matching to health data consortiums, risk prediction for surgeries, and population health management, ADF can be a game-changer in delivering efficient and effective solutions.

 

However, transitioning from an architecture diagram to a fully functional data factory in a real-world scenario is no small feat. Many organizations start with an ADF in a development environment and eventually need to promote it to staging and production environments. So inevitably, there’s that moment when you impress your team and manager because you have the Azure Data Factory working just the way you want it to be working in the development environment. Then your manager says, “This is Great! Let’s get it into Staging and Production as soon as we can!” And then you realize, “Uh Oh!”, this isn’t going to be easy! This process can be daunting, but fear not – we're here to guide you every step of the way.

 

In this multi-part blog series, Joe_Fitzgerald and j_folberth will provide you with invaluable sample guidance for developing and deploying an Azure Data Factory to multiple environments. By the end of this series, you'll be equipped with the skills and knowledge to proudly add "DevOps Engineer" to your professional title.

 

This blog series provides a sample guide that will cover the following topics:

Part 1

  1. Architecture and Scenario
  2. Creating resources in Azure
  3. Create Azure Storage Containers
  4. Create Azure Key Vaults
  5. Create Azure Data Factory: With Key Vault Access

Part 2 

  1. Configure Azure Data Factory Source Control
  2. Construct Azure Data Factory Data Pipeline
  3. Publishing Concept for Azure Data Factory
  4. Configure Deployed Azure Resources.

Part 3

  1. The YAML Pipeline Structure
  2. The Publish Process
  3. ARM Template Parameterization
  4. ADF ARM Template Deployment

Part 4

  1. How to use Azure DevOps Pipeline Templates

Part 5

  1. How to deploy Azure Data Factory Linked Templates

 

Architecture and Scenario

In our quest for a streamlined solution, we will construct an Azure Data Factory (ADF) workflow, also known as a data pipeline, that seamlessly moves files from one folder to another within an Azure Storage container. Imagine a scenario where files are stored in an input folder, waiting for processing, and then need to be placed in an output folder. Our goal is to achieve this with simplicity, security, and efficiency.

 

To ensure the highest level of security, our data factory will utilize the Access Key for the corresponding storage account. However, we won't store this key directly in our code or repository. Instead, we'll leverage the robust security capabilities of Azure Key Vault to securely store the Access Key. By doing so, we reduce potential security vulnerabilities and unauthorized access risks.

 

To further enhance security, we'll assign the managed identity of our Azure Data Factory specific permissions to retrieve and list secrets from the Azure Key Vault. This ensures that only authorized entities can access the sensitive information stored within the vault.

Now, let's discuss the importance of maintaining controlled environments. Our application will have three distinct environments: development (dev), staging, and production. It is critical to emphasize that updates to these environments should only occur through our Continuous Integration/Continuous Deployment (CI/CD) process. Just as you wouldn't allow anyone to log into a web server and make arbitrary updates, the same principle applies to Azure Data Factory. This stringent approach guarantees that all changes are properly tested, reviewed, and deployed in a controlled manner, reducing the potential for errors or security vulnerabilities.

 

Throughout this blog series, we'll dive deeper into each environment, providing valuable insights and practical guidance on establishing the necessary resources and configuring the CI/CD pipeline. Our aim is to empower you to take full control of your data pipeline, ensuring efficiency and security at every step. Get ready to embark on a secure and efficient data journey that will transform the way you handle your Azure Data Factory environment.

 

Creating Resources in Azure

In the initial phase of our journey, we'll focus on manually creating essential resources in Azure to establish a solid foundation for our data factory. To begin, let's create a resource group named "rg-datafactory-demo" to house our data factories, key vaults, and storage accounts. Microsoft's documentation provides comprehensive quick start tutorials that align perfectly with our application architecture. We highly recommend following these tutorials for detailed instructions on creating each resource. You can find the relevant documentation links below.

 

The image below illustrates the Azure resources required for each environment. Each environment will have its own data factory, key vault, and storage account. For simplicity, we have included all the resources within the same resource group in this example. As we move forward in this blog series, we'll delve deeper into the configuration and integration of these resources.

 

 

Create Azure Storage Containers

To lay the groundwork for our data pipeline, we need to create Azure Storage containers. Follow these steps to set up your containers:

 

  1. Creating Storage Accounts
    Begin by following the instructions provided in Microsoft's documentation. Focus on filling out the Basics tab while setting up your storage accounts for each environment. This will ensure that you have the necessary foundation in place for the subsequent steps.
  2. Creating Containers
    For each storage account, create a container named "adftutorial". These containers will serve as the designated spaces for organizing and managing your data within the storage accounts.
  3. Uploading Sample Files
    To simulate real-world scenarios, upload some sample files to the "input" folder within each storage account. These files will serve as the input data for our data pipeline, enabling us to test and validate its functionality effectively.
  4. Creating Environment-Specific Storage Accounts
    To ensure proper separation and control of environments, create a storage account for each environment: development (dev), staging, and production. This segregation allows for independent management and deployment of resources, minimizing the risk of interference or accidental changes.

By following these steps, you'll establish the necessary storage infrastructure for your data pipeline.

 

Create Azure Key Vaults

To ensure the security of your sensitive information, it's crucial to create Azure Key Vaults and securely store your secrets. Follow these steps to set up your key vaults:

 

  1. Creating Key Vaults
    Refer to Microsoft's documentation for detailed instructions on creating key vaults. Create a separate key vault for each environment: development (dev), staging, and production. This segregation allows for better management and control of secrets across different stages of your data pipeline.
  2. Creating Secrets
    Within each key vault, create a secret named "storageaccount-adf-demo". This secret will house your storage account's Access Key, specifically the Connection String extracted from the Access Keys section.

     

  3. Storing Connection Strings
    Retrieve the Connection String for each environment (dev, staging, prod) from their respective storage accounts. To do this, navigate to the desired storage account and select the Access keys menu, which can be found under the Security + networking tab. Copy the Connection String provided.

     

  4. Saving Secrets in Key Vaults
    In each corresponding key vault, save the Connection String as the value for the "storageaccount-adf-demo" secret. Ensure that you save the correct Connection String for each environment, maintaining the appropriate segregation and access controls.

By diligently following these steps, you'll establish a secure and controlled environment for storing your secrets. Later in this blog, we'll delve into the integration of Azure Key Vaults with Azure Data Factory, enabling seamless access to secrets and enhancing the security of your data pipeline managed identities.

 

Creating Azure Data Factory: With Key Vault Access

To finish creating our foundational resources, it's time to create Azure Data Factory instances. In addition, we’ll show you how to configure each data factory to access the secrets in their associated key vault. Follow these steps to set up your data factories:

 

  1. Creating Data Factories
    Refer to the Microsoft documentation for detailed instructions on creating data factories. Create a separate data factory for each environment: development (dev), staging, and production. When configuring the data factories, ensure that the "Configure Git later" option is selected in the Git Configuration tab. Leave all other settings at their default values.

     

  2. Configuring Access in Azure Key Vault
    Once you have created the Azure Key Vault and Azure Data Factory, the next step is to configure the Access settings in the Azure Key Vault. Return to the development environment key vault and click on "Access configuration" under the Settings menu.

     

  3. Setting up Key Vault Access Policy
    To grant data plane access to the data factory, we will utilize a key vault access policy. Select "Go to access policies" and click "Create". In the Secret permissions section, choose the "Get" and "List" permissions. Then, proceed to the next step

     

  4. Selecting the Principal
    Select the development data factory as the Principal for this access policy. Continue clicking "Next" until you reach the "Create" button.

     

  5. Replicate the Configuration
    Repeat the same steps for each data factory and key vault in every environment: dev, staging, and production. This ensures consistent access control across all instances, maintaining security and integrity.

Conclusion

In Part 1 of this series, we discussed what we are trying to accomplish with the architecture scenario, we showed what Azure resources were needed and how to create them. The resources required are data factories, storage accounts and key vaults for each environment (dev, staging, production). In addition, we showed you how to use managed identities for the data factories to access the appropriate keys in the key vaults.

 

By following these steps, you successfully created and configured your Azure Data Factory instances, enabling efficient orchestration of your data workflow. In Part 2 of our blog series, we'll guide you through the process of configuring source control, constructing data pipelines, discuss the “publishing concept”, create an app registration and use it with a service connection that will later be used for deploying the data factory to the “live mode”.

Updated Oct 17, 2023
Version 6.0
No CommentsBe the first to comment