•When there is a need for the Data Components – ADF,ADB and SQL-Pool code to be promoted to higher environment securely without public internet access this blog is useful
•We have integrated the data components to VNET’s, and public access has been disabled for the above use case
•We have built deployment(CI-CD) pipelines in such a way that they can only deploy securely via a Self-hosted agent which has the access to VNET
This Blog will guide you to setup the data components securely with Network diagram included
The Network Architecture diagram below shows the Azure Data Components(Azure Data Factory, Azure Data bricks, Azure Synapse) in Secure Virtual Networks.
All these Azure Data Components cannot be accessible from public internet and are connected to each other securely.
Virtual Networks and components in the Network Architecture Diagram:
Enable Databricks IP access list API in order to:
Configured Databricks Workspace with VNET injection and with no public IP (NPIP) enabled
Encrypt communication between Databricks nodes using global init scripts
dbutils.fs.put("dbfs:/<init-script-folder>/init/enable-encryption.sh", """
!/bin/bash
keystore_file="$DB_HOME/keys/jetty_ssl_driver_keystore.jks"
keystore_password="gb1gQqZ9ZIHS"
sasl_secret=$(sha256sum $keystore_file | cut -d' ' -f1)
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
driver_conf=${DB_HOME}/driver/conf/spark-branch.conf
echo "Configuring driver conf at $driver_conf"
if [ ! -e $driver_conf ] ; then
touch $driver_conf
fi
head -n 1 ${DB_HOME}/driver/conf/spark-branch.conf >> $driver_conf
echo " // Authenticate">> $driver_conf
echo " \\"spark.authenticate\\" = true" >> $driver_conf
echo " \\"spark.authenticate.secret\\" = \\"$sasl_secret\\"" >> $driver_conf
echo " // Configure AES encryption">> $driver_conf
echo " \\"spark.network.crypto.enabled\\" = true" >> $driver_conf
echo " \\"spark.network.crypto.saslFallback\\" = false" >> $driver_conf
echo " // Configure SSL">> $driver_conf
echo " \\"spark.ssl.enabled\\" = true" >> $driver_conf
echo " \\"spark.ssl.keyPassword\\" = \\"$keystore_password\\"" >> $driver_conf
echo " \\"spark.ssl.keyStore\\" = \\"$keystore_file\\"" >> $driver_conf
echo " \\"spark.ssl.keyStorePassword\\" = \\"$keystore_password\\"" >> $driver_conf
echo " \\"spark.ssl.protocol\\" = \\"TLSv1.2\\"" >> $driver_conf
echo " \\"spark.ssl.standalone.enabled\\" = true" >> $driver_conf
echo " \\"spark.ssl.ui.enabled\\" = true" >> $driver_conf
echo " }" >> $driver_conf
echo "Successfully configured driver conf at $driver_conf"
fi
spark_defaults_conf="$DB_HOME/spark/conf/spark-defaults.conf"
echo "Configuring spark defaults conf at $spark_default_conf"
if [ ! -e $spark_defaults_conf ] ; then
touch $spark_defaults_conf
fi
echo "spark.authenticate true" >> $spark_defaults_conf
echo "spark.authenticate.secret $sasl_secret" >> $spark_defaults_conf
echo "spark.network.crypto.enabled true" >> $spark_defaults_conf
echo "spark.network.crypto.saslFallback false" >> $spark_defaults_conf
echo "spark.ssl.enabled true" >> $spark_defaults_conf
echo "spark.ssl.keyPassword $keystore_password" >> $spark_defaults_conf
echo "spark.ssl.keyStore $keystore_file" >> $spark_defaults_conf
echo "spark.ssl.keyStorePassword $keystore_password" >> $spark_defaults_conf
echo "spark.ssl.protocol TLSv1.2" >> $spark_defaults_conf
echo "spark.ssl.standalone.enabled true" >> $spark_defaults_conf
echo "spark.ssl.ui.enabled true" >> $spark_defaults_conf
echo "Successfully configured spark defaults conf at $spark_default_conf"
""", True)
{
"name": "ls_databricks",
"properties": {
"description": "Linked Service for connecting to Databricks",
"annotations": [],
"type": "AzureDatabricks",
"typeProperties": {
"domain": "https://adb-XXXXX.net",
"authentication": "MSI",
"workspaceResourceId": "/subscriptions/XXXXXX/resourceGroups/rg-dev/providers/Microsoft.Databricks/workspaces/XXXXX",
"newClusterNodeType": "Standard_DS4_v2",
"newClusterNumOfWorker": "2:10",
"newClusterSparkEnvVars": {
"PYSPARK_PYTHON": "/databricks/python3/bin/python3"
},
"newClusterVersion": "8.2.x-scala2.12",
"newClusterInitScripts": []
},
"connectVia": {
"referenceName": "selfHostedIr",
"type": "IntegrationRuntimeReference"
}
}
}
Purpose:
In order to run CI and CD pipelines through a secure VNET. We need to install a VM(connected to a VNET/SUBNET) as a self hosted agent in Azure Devops.
Installation Procedure:
Because the agent will be installed on a brand new blank windows image, other dependencies/packages need to be installed on the virtual machine in order for our CI/CD pipelines to run.
Example:
Let's consider an example scenario that we need to install some modules like below
Here is a list of packages to install and where to install them.
sqlpackage
to the system PATH in order for azure devops to recognize this as a capability for this machineInstall-Module 'name-of-module'
ADDING SELF HOSTED AGENTS IN OUR CI-CD YAML deployment PIPELINES:
•The code shows how to run your release agent on a specific self hosted agent(connected to VNET):
- Take note of the pool and demands configuration
Considering we are deploying Data bricks notebooks in the below case
stages:
- stage: Release
displayName: Release stage
jobs:
- deployment: DeployDatabricks
displayName: Deploy Databricks Notebooks
pool:
name: DataPool
demands:
- agent.name -equals vm-ado
environment: Data-SANDBOX
•In the continuous deployment pipeline, we must deploy the artifacts build to Dev , QA, UAT and prod environments. We will have approval gates setup before the deployment to each environment(stage) gets started.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.