Blog Post

Azure Integration Services Blog
7 MIN READ

Automated Secure Infrastructure for Self Hosted Integration Runtime in Azure Data Factory Terraform

joshuamurari's avatar
joshuamurari
Icon for Microsoft rankMicrosoft
Mar 27, 2023

Purpose/Summary

This purpose of this project is to document how to securly automate a self hosted integration runtime (SHIR) for Azure Data Factory using terraform.

 

According to Microsoft Documentation , a self-hosted integration runtime can run copy activities between a cloud data store and a data store in a private network. It also can dispatch transform activities against compute resources in an on-premises network or an Azure virtual network. The installation of a self-hosted integration runtime needs an on-premises machine or a virtual machine inside a private network.

 

Automating this process requires an installtion script that will be used on a preferred virtual machine. Storing this script in a secure location and allowing your deployment to access this script becomes difficult in a completely private and secure Azure Data Factory environment. The technique used here is to upload the script to a secure storage account and allow the virtual machine private access to download the script from that same storage account


Pre-Requesites

  • Azure subscription and contributor rights
  • You should already have a remote terraform backend set up. Follow this process  if you need help.
  • Azure DevOps Pipelines with a corresponding service connection set up to your Azure Subscription. Follow this  for help.
  • Self hosted Azure DevOps agent that is connected to your secure environment. NOTE: this is only required if you are using a fully private VNET and locked down enviornment/storage account.
    • These agents are used to run the terraform deployment to your secure enviornment. Because they can access your private network, they can safely run the deployment to your resources. Running an initial deployment on an Azure Public agent could work for the inital setup, but will not work for any subsequent runs as the Public agent will not have access to your secure enviornment. 

 

Workflow Diagram

Here you can see that an Azure DevOps Pipeline is run on a Self Hosted DevOps Agent that is connected to the secure VNET of the target enviornment. This pipeline runs a Terraform deployment to set up a secure VNET, Private Storage Account, Virtual Machine (for the SHIR), and an Azure Data Factory.

Walkthrough and Set Up Video

 

Infrastructure Instructions

Sample powershell installation script

This script will install the SHIR gateway from a virtual machine to a specific Azure Data Factory. The only prarmeter needed here is a authentication key for the gateway.

 

param(
 [string]
 $gatewayKey
)

# init log setting
$logLoc = "$env:SystemDrive\WindowsAzure\Logs\Plugins\Microsoft.Compute.CustomScriptExtension\"
if (! (Test-Path($logLoc))) {
	New-Item -path $logLoc -type directory -Force
}
$logPath = "$logLoc\tracelog.log"
"Start to excute gatewayInstall.ps1. `n" | Out-File $logPath

function Now-Value() {
	return (Get-Date -Format "yyyy-MM-dd HH:mm:ss")
}

function Throw-Error([string] $msg) {
	try {
		throw $msg
	} 
	catch {
		$stack = $_.ScriptStackTrace
		Trace-Log "DMDTTP is failed: $msg`nStack:`n$stack"
	}

	throw $msg
}

function Trace-Log([string] $msg) {
	$now = Now-Value
	try {
		"${now} $msg`n" | Out-File $logPath -Append
	}
	catch {
		#ignore any exception during trace
	}

}

function Run-Process([string] $process, [string] $arguments) {
	Write-Verbose "Run-Process: $process $arguments"
	
	$errorFile = "$env:tmp\tmp$pid.err"
	$outFile = "$env:tmp\tmp$pid.out"
	"" | Out-File $outFile
	"" | Out-File $errorFile	

	$errVariable = ""

	if ([string]::IsNullOrEmpty($arguments)) {
		$proc = Start-Process -FilePath $process -Wait -Passthru -NoNewWindow `
			-RedirectStandardError $errorFile -RedirectStandardOutput $outFile -ErrorVariable errVariable
	}
	else {
		$proc = Start-Process -FilePath $process -ArgumentList $arguments -Wait -Passthru -NoNewWindow `
			-RedirectStandardError $errorFile -RedirectStandardOutput $outFile -ErrorVariable errVariable
	}
	
	$errContent = [string] (Get-Content -Path $errorFile -Delimiter "!!!DoesNotExist!!!")
	$outContent = [string] (Get-Content -Path $outFile -Delimiter "!!!DoesNotExist!!!")

	Remove-Item $errorFile
	Remove-Item $outFile

	if ($proc.ExitCode -ne 0 -or $errVariable -ne "") {		
		Throw-Error "Failed to run process: exitCode=$($proc.ExitCode), errVariable=$errVariable, errContent=$errContent, outContent=$outContent."
	}

	Trace-Log "Run-Process: ExitCode=$($proc.ExitCode), output=$outContent"

	if ([string]::IsNullOrEmpty($outContent)) {
		return $outContent
	}

	return $outContent.Trim()
}

function Download-Gateway([string] $url, [string] $gwPath) {
	try {
		$ErrorActionPreference = "Stop";
		$client = New-Object System.Net.WebClient
		$client.DownloadFile($url, $gwPath)
		Trace-Log "Download gateway successfully. Gateway loc: $gwPath"
	}
	catch {
		Trace-Log "Fail to download gateway msi"
		Trace-Log $_.Exception.ToString()
		throw
	}
}

function Install-Gateway([string] $gwPath) {
	if ([string]::IsNullOrEmpty($gwPath)) {
		Throw-Error "Gateway path is not specified"
	}

	if (!(Test-Path -Path $gwPath)) {
		Throw-Error "Invalid gateway path: $gwPath"
	}
	
	Trace-Log "Start Gateway installation"
	Run-Process "msiexec.exe" "/i gateway.msi INSTALLTYPE=AzureTemplate /quiet /norestart"		
	
	Start-Sleep -Seconds 30	

	Trace-Log "Installation of gateway is successful"
}

function Get-RegistryProperty([string] $keyPath, [string] $property) {
	Trace-Log "Get-RegistryProperty: Get $property from $keyPath"
	if (! (Test-Path $keyPath)) {
		Trace-Log "Get-RegistryProperty: $keyPath does not exist"
	}

	$keyReg = Get-Item $keyPath
	if (! ($keyReg.Property -contains $property)) {
		Trace-Log "Get-RegistryProperty: $property does not exist"
		return ""
	}

	return $keyReg.GetValue($property)
}

function Get-InstalledFilePath() {
	$filePath = Get-RegistryProperty "hklm:\Software\Microsoft\DataTransfer\DataManagementGateway\ConfigurationManager" "DiacmdPath"
	if ([string]::IsNullOrEmpty($filePath)) {
		Throw-Error "Get-InstalledFilePath: Cannot find installed File Path"
	}
	Trace-Log "Gateway installation file: $filePath"

	return $filePath
}

function Register-Gateway([string] $instanceKey) {
	Trace-Log "Register Agent"
	$filePath = Get-InstalledFilePath
	Run-Process $filePath "-era 8060"
	Run-Process $filePath "-k $instanceKey"
	Trace-Log "Agent registration is successful!"
}



if ((Get-Process "diahost" -ea SilentlyContinue) -eq $Null) {
	Trace-Log "Integration Runtime is not running. Initiating Download - Install - Register sequence.";
	
	Trace-Log "Log file: $logLoc"
	$uri = "https://go.microsoft.com/fwlink/?linkid=839822"
	Trace-Log "Gateway download fw link: $uri"
	$gwPath = "$PWD\gateway.msi"
	Trace-Log "Gateway download location: $gwPath"
	
	
	Download-Gateway $uri $gwPath
	Install-Gateway $gwPath
	
	Register-Gateway $gatewayKey

}
else {
	Trace-Log "Integration Runtime is already running. Skipping installation & configuration.";
};

 


Walthrough of Terraform Code

Set up Azure Data Factory and Self Hosted Integration Runtime
resource "azurerm_data_factory" "adf" {
  name                = "adf-poc-${random_string.random.result}"
  location            = "East US"
  resource_group_name = azurerm_resource_group.rg.name
}

resource "azurerm_data_factory_integration_runtime_self_hosted" "shir" {
  name                = "adf-poc-shir"
  resource_group_name = azurerm_resource_group.rg.name
  data_factory_id     = azurerm_data_factory.adf.id
}

The second resource here is needed to create the self hosted integration runtime within Azure data factory and is also used because this resource exposes the authentication key  to create a gateway and connect a virtual machine to this runtime. This is the gateway key that we will pass into our powershell script.

 

Set up secure storage account to store script
  • The following code shows setting up the storage account. Then setting up a private container within the storage account. And lastly uploading the installtion script as a blob object to the container.
#Storage account
resource "azurerm_storage_account" "storageaccount" {
  name                     = "shirst${random_string.random.result}"
  resource_group_name      = azurerm_resource_group.rg.name
  location                 = var.location
  account_tier             = "Standard"
  account_replication_type = "LRS"
  account_kind             = "StorageV2"
  min_tls_version          = "TLS1_2"

  blob_properties {
    cors_rule {
      allowed_headers    = ["*"]
      allowed_methods    = ["DELETE", "GET", "HEAD", "MERGE", "POST", "OPTIONS", "PUT", "PATCH"]
      allowed_origins    = ["*"]
      exposed_headers    = ["*"]
      max_age_in_seconds = 200
    }
  }
}

#Storage container and blob
resource "azurerm_storage_container" "newcontainer" {
  name                  = "shir-script"
  storage_account_name  = azurerm_storage_account.storageaccount.name
  container_access_type = "private"
}
resource "azurerm_storage_blob" "newblob" {
  name                   = "adf-shir.ps1"
  storage_account_name   = azurerm_storage_account.storageaccount.name
  storage_container_name = azurerm_storage_container.newcontainer.name
  type                   = "Block"
  access_tier            = "Cool"
  source                 = "../gatewayinstall.ps1"
}
Securely Lock down storage account network to make it private
  • This only allows resources within the same private network of your environment to access this storage account
#Storage network rules
resource "azurerm_storage_account_network_rules" "storageaccountnetworkrules" {
  resource_group_name  = azurerm_resource_group.rg.name
  storage_account_name = azurerm_storage_account.storageaccount.name

  default_action             = "Deny"
  ip_rules                   = []
  virtual_network_subnet_ids = []
  bypass                     = ["Metrics", "Logging", "AzureServices"]
  depends_on = [
    azurerm_storage_blob.newblob
  ]
}

 

Allow network access to storage account via private endpoints
  • This allows a secure private ip address for communication to and from this storage account
#--------Storage Account Private Endpoints and DNS A Records--------#
#DFS
resource "azurerm_private_endpoint" "pe_000" {
  name                = "${azurerm_storage_account.storageaccount.name}-dfs"
  location            = var.location
  resource_group_name = azurerm_resource_group.rg.name
  subnet_id           = azurerm_subnet.pe.id

  private_service_connection {
    name                           = "${azurerm_storage_account.storageaccount.name}-connection"
    private_connection_resource_id = azurerm_storage_account.storageaccount.id
    is_manual_connection           = false
    subresource_names              = ["dfs"]
  }

  private_dns_zone_group {
    name                 = azurerm_private_dns_zone.dfs_privatednszone.name
    private_dns_zone_ids = [azurerm_private_dns_zone.dfs_privatednszone.id]
  }
}

resource "azurerm_private_dns_a_record" "privatednsarecord-000" {
  name                = azurerm_private_endpoint.pe_000.name
  zone_name           = azurerm_private_dns_zone.dfs_privatednszone.name
  resource_group_name = azurerm_resource_group.rg.name
  ttl                 = "300"
  records             = [azurerm_private_endpoint.pe_000.private_service_connection.0.private_ip_address]
  depends_on          = [azurerm_private_endpoint.pe_000]
}

#Blob
resource "azurerm_private_endpoint" "pe_001" {
  name                = "${azurerm_storage_account.storageaccount.name}-blob"
  location            = var.location
  resource_group_name = azurerm_resource_group.rg.name
  subnet_id           = azurerm_subnet.pe.id

  private_service_connection {
    name                           = "${azurerm_storage_account.storageaccount.name}-connection"
    private_connection_resource_id = azurerm_storage_account.storageaccount.id
    is_manual_connection           = false
    subresource_names              = ["blob"]
  }

  private_dns_zone_group {
    name                 = azurerm_private_dns_zone.blob_privatednszone.name
    private_dns_zone_ids = [azurerm_private_dns_zone.blob_privatednszone.id]
  }
}

resource "azurerm_private_dns_a_record" "privatednsarecord-001" {
  name                = azurerm_private_endpoint.pe_001.name
  zone_name           = azurerm_private_dns_zone.blob_privatednszone.name
  resource_group_name = azurerm_resource_group.rg.name
  ttl                 = "300"
  records             = [azurerm_private_endpoint.pe_001.private_service_connection.0.private_ip_address]
  depends_on          = [azurerm_private_endpoint.pe_001]
}

 

Create virtual machine for self hosted integration runtime
  • This windows machine uses Microsoft's default 2019 image.
    NOTE: The admin password is exposed below for testing and deployment purposes. For a secure production ready deployment, this password should only be referenced via an Azure Key Vault injection.
#Windows VM
resource "azurerm_windows_virtual_machine" "main" {
  name                  = "shir-vm-${random_string.random.result}"
  location              = var.location
  resource_group_name   = azurerm_resource_group.rg.name
  network_interface_ids = [azurerm_network_interface.nic.id]
  size               = "Standard_B1s"
  admin_username = "testadmin"
  admin_password = "Password1234!"
  source_image_reference {
    publisher = "MicrosoftWindowsServer"
    offer     = "WindowsServer"
    sku       = "2019-Datacenter"
    version   = "latest"
  }
  os_disk {
    name              = "myosdisk1"
    caching           = "ReadWrite"
    storage_account_type = "Standard_LRS"
  }

  identity {
    type = "SystemAssigned"
  }
}
Create Virtual Machine extenstion to install Self Hosted Integration runtime Gateway Script
  • Here the VM extension will reference the script stored in our secure storage account to install on the machine.
  • This connection is secure because the VM is communicating to the storage account via our private endpoints in the same secure enviornment as our other resources.
#VM Custom Script Extension
resource "azurerm_virtual_machine_extension" "vmextension-0000" {
  name                       = "ADF-SHIR"
  virtual_machine_id         = azurerm_windows_virtual_machine.main.id
  publisher                  = "Microsoft.Compute"
  type                       = "CustomScriptExtension"
  type_handler_version       = "1.10"
  auto_upgrade_minor_version = true

  protected_settings = <<PROTECTED_SETTINGS
      {
          "fileUris": ["${format("https://%s.blob.core.windows.net/%s/%s", azurerm_storage_account.storageaccount.name, azurerm_storage_container.newcontainer.name, azurerm_storage_blob.newblob.name)}"],
          "commandToExecute": "${join(" ", ["powershell.exe -ExecutionPolicy Unrestricted -File",azurerm_storage_blob.newblob.name,"-gatewayKey ${azurerm_data_factory_integration_runtime_self_hosted.shir.auth_key_1}"])}",
          "storageAccountName": "${azurerm_storage_account.storageaccount.name}",
          "storageAccountKey": "${azurerm_storage_account.storageaccount.primary_access_key}"
      }
  PROTECTED_SETTINGS
}

 

Updated Sep 23, 2022
Version 1.0