Azure Data Components Network Architecture with secure configurations
Published Feb 09 2022 12:30 AM 11.1K Views
Microsoft

Use Case :

•When there is a  need for  the Data Components – ADF,ADB and SQL-Pool code to be promoted to higher environment  securely without public internet access this blog is useful

•We have integrated the data components to VNET’s, and public access has been disabled for the above use case

•We have built deployment(CI-CD) pipelines in such a way that they can only deploy  securely via a Self-hosted agent which has the access to VNET

This Blog will guide you to setup the data components securely with Network diagram included

 

Network Architecture for Azure Data Resources in V-NET

  • The Network Architecture diagram below shows the Azure Data Components(Azure Data Factory, Azure Data bricks, Azure Synapse) in Secure Virtual Networks.

  • All these Azure Data Components cannot be accessible from public internet and are connected to each other securely.

  • Virtual Networks and components in the Network Architecture Diagram:

    1. Azure Synapse workspace and Azure Data Factory are provisioned with Managed Virtual Networks(Azure Data Factory Managed VNET, Synapse Managed VNET)
    2. The Azure Databricks is within a custom Virtual Network(Data VNET)
    3. The Azure Storage Accounts, Azure Key Vault , Azure Synapse workspace and Azure Data Factory are connected to the Data VNET by using Private Endpoints so that data transfer between these components is secure.
    4. A virtual machine (within the Data VNET/separate SUBNET) is configured as ADF SHIR(Self Hosted Integration Runtime) to run Azure Databricks notebooks from Azure Data Factory.
    5. A virtual machine within the Data VNET is configured as Azure DevOps Self Hosted Build Agent for CI-CD (Continuous Integration, Continuous Deployment) Pipelines to run as these Data components cannot be accessible from public internet.
    6. A virtual machine used as a jumpbox with Bastion login is configured so that application code can be accessed securely ONLY in the DEV environments. (This machine will not be present in any higher environments)

Akshay_Attota_2-1644389187236.png

 

Azure Data Components Secure Network Setup

  • This section explains how the data components are configured securely so that only components within the virtual networks can access them and the public internet access is restricted.

Synapse Secure Network setup:

  • Synapse Workspace is setup with a managed VNET
  • Synapse Workspace is configured using private link hub
  • Synapse Workspace must be connected via private endpoints through the private link hub.
    poornamishra_0-1644397549321.png

     

  • Public Network Access to workspace endpoints must be disabled
  • Akshay_Attota_1-1644390563588.png

     

  • TDE(Transparent Data Encryption) needs to be enabled         

Akshay_Attota_2-1644392359365.png

  • Managed identity needs be provisioned during creation
    Akshay_Attota_4-1644392420713.png
  • SQL Active Directory Admin should be enabled as a group, not for one specific user
  • Azure Resource Locks should be turned on to prevent accidental user deletion

Databricks Workspace Secure Network Setup:

  • Enable Databricks IP access list API in order to:

    1. Restrict Portal access to Databricks workspace for specific IP addresses
    2. Restrict Databricks API calls to specific IP addresses
      Akshay_Attota_8-1644392578388.png
  • Configured Databricks Workspace with VNET injection and with no public IP (NPIP) enabled

    Akshay_Attota_9-1644392612381.png

     

  • Encrypt communication between Databricks nodes using global init scripts

    1. Configure an <init-script-folder> with the location to put the init script.
    2. Run the notebook below to create the script enable-encryption.sh.
    3. Configure the Databricks workspace with the enable-encryption.sh global init script using the global init script REST API.
    4. Below is the notebook to create enable-encryption.sh:

 

 

 

 

 

 

 

dbutils.fs.put("dbfs:/<init-script-folder>/init/enable-encryption.sh", """
!/bin/bash
 
keystore_file="$DB_HOME/keys/jetty_ssl_driver_keystore.jks"
keystore_password="gb1gQqZ9ZIHS"
sasl_secret=$(sha256sum $keystore_file | cut -d' ' -f1)
 
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
  driver_conf=${DB_HOME}/driver/conf/spark-branch.conf
  echo "Configuring driver conf at $driver_conf"
  if [ ! -e $driver_conf ] ; then
    touch $driver_conf
  fi
    
  head -n 1 ${DB_HOME}/driver/conf/spark-branch.conf >> $driver_conf
 
  echo "  // Authenticate">> $driver_conf
  echo "  \\"spark.authenticate\\" = true" >> $driver_conf
  echo "  \\"spark.authenticate.secret\\" = \\"$sasl_secret\\"" >> $driver_conf
 
  echo "  // Configure AES encryption">> $driver_conf
  echo "  \\"spark.network.crypto.enabled\\" = true" >> $driver_conf
  echo "  \\"spark.network.crypto.saslFallback\\" = false" >> $driver_conf
 
  echo "  // Configure SSL">> $driver_conf
  echo "  \\"spark.ssl.enabled\\" = true" >> $driver_conf
  echo "  \\"spark.ssl.keyPassword\\" = \\"$keystore_password\\"" >> $driver_conf
  echo "  \\"spark.ssl.keyStore\\" = \\"$keystore_file\\"" >> $driver_conf
  echo "  \\"spark.ssl.keyStorePassword\\" = \\"$keystore_password\\"" >> $driver_conf
  echo "  \\"spark.ssl.protocol\\" = \\"TLSv1.2\\"" >> $driver_conf
  echo "  \\"spark.ssl.standalone.enabled\\" = true" >> $driver_conf
  echo "  \\"spark.ssl.ui.enabled\\" = true" >> $driver_conf
  echo " }"  >> $driver_conf
  echo "Successfully configured driver conf at $driver_conf"
fi  
 
spark_defaults_conf="$DB_HOME/spark/conf/spark-defaults.conf"
echo "Configuring spark defaults conf at $spark_default_conf"
if [ ! -e $spark_defaults_conf ] ; then
  touch $spark_defaults_conf
fi
echo "spark.authenticate true" >> $spark_defaults_conf
echo "spark.authenticate.secret $sasl_secret" >> $spark_defaults_conf
 
echo "spark.network.crypto.enabled true" >> $spark_defaults_conf
echo "spark.network.crypto.saslFallback false" >> $spark_defaults_conf
 
echo "spark.ssl.enabled true" >> $spark_defaults_conf
echo "spark.ssl.keyPassword $keystore_password" >> $spark_defaults_conf
echo "spark.ssl.keyStore $keystore_file" >> $spark_defaults_conf
echo "spark.ssl.keyStorePassword $keystore_password" >> $spark_defaults_conf
echo "spark.ssl.protocol TLSv1.2" >> $spark_defaults_conf
echo "spark.ssl.standalone.enabled true" >> $spark_defaults_conf
echo "spark.ssl.ui.enabled true" >> $spark_defaults_conf
echo "Successfully configured spark defaults conf at $spark_default_conf"
""", True)

 

 

 

 

 

 

 

 

Azure Data Factory Secure Network Setup:

  • This ADF is provisioned with a managed VNET
  • The network access of ADF is set to connect via private endpoints to the Data VNET(Custom VNET)
    Akshay_Attota_13-1644393123383.png
  • Create a SHIR (Self Hosted Integration Runtime) for the Data Factory to access resources within the Data VNET.
    Akshay_Attota_14-1644393218728.png

     

  • SHIR in Linked Services
    • Datafactory is connected to databricks via SHIR that is in the same databricks vnet, but on a seperate subnet. This is authenticated via managed identity and must be having contributor RBAC permissions on that subnet.
  • Example of databricks linked service
{
   "name": "ls_databricks",
   "properties": {
       "description": "Linked Service for connecting to Databricks",
       "annotations": [],
       "type": "AzureDatabricks",
       "typeProperties": {
           "domain": "https://adb-XXXXX.net",
           "authentication": "MSI",
           "workspaceResourceId": "/subscriptions/XXXXXX/resourceGroups/rg-dev/providers/Microsoft.Databricks/workspaces/XXXXX",
           "newClusterNodeType": "Standard_DS4_v2",
           "newClusterNumOfWorker": "2:10",
           "newClusterSparkEnvVars": {
               "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
           },
           "newClusterVersion": "8.2.x-scala2.12",
           "newClusterInitScripts": []
       },
       "connectVia": {
           "referenceName": "selfHostedIr",
           "type": "IntegrationRuntimeReference"
       }
   }
}
  • Create Managed Private Endpoints for accessing resources outside the ADF managed V-net(which don't have public internet access). For Example, Synapse SQL pool cannot be accessed by the public internet and it will be outside the ADF managed vnet. Therefore a Managed Private Endpoint needs to be created for Data Factory access to Synapse SQL Pool.

Akshay_Attota_15-1644393303919.png

Azure Key Vault

  • Azure KeyVault should be configured with a private endpoint to prevent access from the public internet.
  • In addition to using Azure KeyVault for secret scope management, it can be integrated with Azure Databricks for Azure KeyVault-backed scope.

Azure Datalake Storage Accounts

  • Public Access to all Data Lakes should be disabled.
  • Private Endpoint Access should be configured for all Data Lakes
  • VNET Access is configured where necessary for Azure Storage Explorer on custom VNET located VMs.
  • ACL permissions to containers are programmatically handled via PowerShell code
  • RBACs are restricted to Azure Resource Managed Identities when specifically required e.g Azure Data Factory Storage Blob Data Contributor Role.
  • Along with the above mentioned points, here are the storage exception and Network routing

Akshay_Attota_16-1644393414477.png

Self Hosted Agent Installation Procedure:

Purpose:

In order to run CI and CD pipelines through a secure VNET. We need to install a VM(connected to a VNET/SUBNET) as a self hosted agent in Azure Devops.

 

Installation Procedure:

  1. Log on to the VM you want the self hosted agent installed on via Bastion
    • Akshay_Attota_6-1644394159919.png

       

  2. Within the Virtual Machine, navigate to the Azure Devops website using a web browser and log in.
  3. Create a new agent pool or use an existing pool
    • Navigate to project settings/agent pools within azure devops to create a new pool or view the existing pool that you want to use
    • Akshay_Attota_7-1644394268375.png

       

    • Akshay_Attota_8-1644394317918.png

       

    • Follow the add pool dialogue to create a new pool if needed
    • Akshay_Attota_9-1644394373079.png

       

  4. Within the pool you want to use, navigate to the agents tab and select new agent.
    • Akshay_Attota_10-1644394420466.png

       

  5. Follow the instructions outlined here: https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/v2-windows?view=azure-devops 
    • This will show you how to create a PAT token to authenticate the agent (this token is only used once at authentication time and never used again)
    • Make sure to install the agent to run as a service

 

Self Hosted Agent Dependencies for Pipelines:

Because the agent will be installed on a brand new blank windows image, other dependencies/packages need to be installed on the virtual machine in order for our CI/CD pipelines to run.

Example:

Let's consider an example scenario that we need to install some modules like below 

Here is a list of packages to install and where to install them.

ADDING SELF HOSTED AGENTS IN OUR CI-CD YAML deployment PIPELINES:

•The code shows how to run your release agent on a specific self hosted agent(connected to VNET):

- Take note of the pool and demands configuration

Considering we are deploying Data bricks notebooks in the below case

 

 

 

 

 

stages:
- stage: Release
  displayName: Release stage

  jobs:
  - deployment: DeployDatabricks
    displayName: Deploy Databricks Notebooks
    pool:
      name: DataPool
      demands:
        - agent.name -equals vm-ado
    environment: Data-SANDBOX

 

 

 

 

 

 

•In the continuous deployment pipeline, we must deploy the artifacts build to Dev , QA, UAT and prod environments. We will have approval gates setup before the deployment to each environment(stage) gets started.

Version history
Last update:
‎Feb 11 2022 04:04 AM
Updated by: