How to Achieve Coupe of functionalities in Azure Data Factory Dyanmically.

MVP

Feb 03, 2025

How about Azure Data Factory?

Create a Pipeline:
- Create a new pipeline in Azure Data Factory.
Add Activities to Pipeline:
- Get Metadata Activity: Use this activity to list all the files in the source blob folder.
- ForEach Activity: Loop through each file using the output of the Get Metadata activity.
- Inside the ForEach Activity:
  - Copy Activity: Copy the file content from the source blob to a staging area in Azure Data Lake Storage Gen2.
  - Data Flow Activity: Add a Data Flow activity to process the CSV files and perform validations.
Create a Data Flow:
- Source Transformation: Add a source transformation to read the CSV file from the staging area.
- Derived Column Transformation: Add a derived column transformation to add a row count column.
- Conditional Split Transformation: Add a conditional split transformation to separate valid and invalid rows.
  - Valid Rows: Rows that meet the criteria.
  - Invalid Rows: Rows that contain invalid data.
- Sink Transformation: Add two sink transformations:
  - Valid Rows Sink: Write the valid rows to a Parquet file in the destination folder.
  - Invalid Rows Sink: Write the invalid rows to a separate CSV file in another blob storage folder.

Example on Pipeline and Data Flow:

Pipeline

Get Metadata Activity:
- Configure the activity to list all the files in the source blob folder.
- Output: List of file names.
ForEach Activity:
- Items: @activity('Get Metadata Activity').output.childItems
- Inside the ForEach activity, add the following activities:
  - Copy Activity:
    - Source: Source blob storage.
    - Sink: Staging area in Azure Data Lake Storage Gen2.
  - Data Flow Activity:
    - Parameters: Pass the file name to the data flow.

Data Flow

Source Transformation:
- Source: Staging area in Azure Data Lake Storage Gen2.
- Options: Enable header and specify delimiter as comma.
Derived Column Transformation:
- Add a column for row count: RowCount = rownum()
Conditional Split Transformation:
- Valid Rows Condition: length(toString(TestColumn1)) > 0 and length(toString(TestColumn2)) > 0 and length(toString(TestColumn3)) > 0
- Invalid Rows Condition: !(length(toString(TestColumn1)) > 0 and length(toString(TestColumn2)) > 0 and length(toString(TestColumn3)) > 0)
Sink Transformation:
- Valid Rows Sink: Write to Parquet file.
- Invalid Rows Sink: Write to CSV file in another blob storage folder.

Forum Discussion