@description('Name of the Azure storage account that will contain the file we will split.')
param storageAccountName string = 'storage${uniqueString(resourceGroup().id)}'
resource storageAccount 'Microsoft.Storage/storageAccounts@2021-08-01' existing = {
name: storageAccountName
}
resource dataFactoryLinkedService 'Microsoft.DataFactory/factories/linkedservices@2018-06-01' = {
parent: dataFactory
name: dataFactoryLinkedServiceName
properties: {
type: 'AzureBlobStorage'
typeProperties: {
connectionString: 'DefaultEndpointsProtocol=https;AccountName=${storageAccount.name};AccountKey=${storageAccount.listKeys().keys[0].value}'
}
}
}
Bicep code to create our Azure Data Factory Data Flow
Based on the following reference “Microsoft.DataFactory factories/linkedservices” we will create the Azure Data Factory Data Flow that will split our file into multiple files.
@description('The Blob s name that will be splitted')
param blobNameToSplit string = 'file.csv'
@description('The Blob s folder path that will be splitted')
param blobFolderToSplit string = 'input'
@description('The Blob s folder path that will be splitted')
param blobOutputFolder string = 'output'
resource dataFactoryLinkedService 'Microsoft.DataFactory/factories/linkedservices@2018-06-01' = {
parent: dataFactory
name: dataFactoryLinkedServiceName
properties: {
type: 'AzureBlobStorage'
typeProperties: {
connectionString: 'DefaultEndpointsProtocol=https;AccountName=${storageAccount.name};AccountKey=${storageAccount.listKeys().keys[0].value}'
}
}
}
resource dataFactoryDataFlow 'Microsoft.DataFactory/factories/dataflows@2018-06-01' = {
parent: dataFactory
name: dataFactoryDataFlowName
properties: {
type: 'MappingDataFlow'
typeProperties: {
sources: [
{
linkedService: {
referenceName: dataFactoryLinkedService.name
type: 'LinkedServiceReference'
}
name: 'source'
description: 'File to split'
}
]
sinks: [
{
linkedService: {
referenceName: dataFactoryLinkedService.name
type: 'LinkedServiceReference'
}
name: 'sink'
description: 'Splitted data'
}
]
transformations: []
scriptLines: [
'source(useSchema: false,'
' allowSchemaDrift: true,'
' validateSchema: false,'
' ignoreNoFilesFound: false,'
' format: \'delimited\','
' container: \'${blobContainerName}\','
' folderPath: \'${blobFolderToSplit}\','
' fileName: \'${blobNameToSplit}\','
' columnDelimiter: \',\','
' escapeChar: \'\\\\\','
' quoteChar: \'\\\'\','
' columnNamesAsHeader: true) ~> source'
'source sink(allowSchemaDrift: true,'
' validateSchema: false,'
' format: \'delimited\','
' container: \'${blobContainerName}\','
' folderPath: \'output\','
' columnDelimiter: \',\','
' escapeChar: \'\\\\\','
' quoteChar: \'\\\'\','
' columnNamesAsHeader: true,'
' filePattern:(concat(\'${blobNameToSplit}\', toString(currentTimestamp(),\'yyyyMMddHHmmss\'),\'-[n].csv\')),'
' skipDuplicateMapInputs: true,'
' skipDuplicateMapOutputs: true,'
' partitionBy(\'${partitionType}\', ${numberOfPartition})) ~> sink'
]
}
}
}
When using the az deployment what-if option we can see the following changes. This is really convenient to see the asked changes before applying them.
numberOfSplittedFiles=3
blobFolderToSplit="input"
blobNameToSplit="file.csv"
blobOutputFolder="output"
resourceGroupName=myDataFactoryResourceGroup
dataFactoryName=myDataFactoryName
storageAccountName=myStorageAccountName
blobContainerName=myStorageAccountContainerName
az deployment group what-if \
--resource-group $resourceGroupName \
--template-file data-factory-data-flow-split-file.bicep \
--parameters dataFactoryName=$dataFactoryName \
storageAccountName=$storageAccountName \
blobContainerName=$blobContainerName \
numberOfPartition=$numberOfSplittedFiles \
blobFolderToSplit=$blobFolderToSplit \
blobNameToSplit=$blobNameToSplit \
blobOutputFolder=$blobOutputFolder
The Data Flow looks like the following screenshot where we can see the number of partition that will be created. In our context it corresponds to the number of csv files that will be generated from our input csv file.
The other trick here is to play with a file name pattern to manage the target files names.
The output files in this sample will be set to fit with the input file name, the current date and the output file iteration.
concat('file.csv', toString(currentTimestamp(),'yyyyMMddHHmmss'),'-[n].csv')
Split the file through the Pipeline
Through the procedure located here https://github.com/JamesDLD/bicep-data-factory-data-flow-split-file we have created an Azure Data Factory pipeline named “ArmtemplateSampleSplitFilePipeline”, you can trigger it to launch the Data Flow that will split the file.
The following screenshot illustrates the split result done through Azure Data Factory Data Flow.
Conclusion
Considering Bicep or any other Infrastructure as Code (IaC) tool ensures to gain efficiency and agility, its a real ramp up when designing infrastructures and it makes them reproducible and testable.
See You in the Cloud
Jamesdld