Introduction: Delta files are a way to efficiently store data in data lake by storing only the changes made to the data. Its built-on parquet files, it enables you to update, insert and delete your data easily in delta lake. the updateability is the main advantage of delta format. When we create delta files in data lake, you will notice that we will get _delta_logfolder which indicates which files are valid and updated, this enables versioning which is a powerful feature of delta files also we will get parquet files as mentioned above, because delta files are built on parquet like so:
Main Idea: The main idea of this blog is to create a simple ETL using delta files. 1. Convert raw data (csv format) into delta format using Azure Synapse Pipelines.
2. Query delta files using SQL serverless pool. 3. Build Dashboard using PowerBI.
A Storage account with hierarchical structure (ADLS)
Azure Synapse workspace.
Metadata - A sample CSV file.
A basic knowledge in Azure.
Unfortunately, we can't use the Copy activity to transform .csv file to delta file format it is only supported in Dataflows at the moment.
Convert .csv file into Delta file using Azure Synapse:
Open Synapse studio, click on Integrate tab, add new pipeline with a name "ConvertCSVToDeltaFile"
In Activities, drag Dataflow activity to the pipeline ConvertCSVToDeltaFile , in dataflow activity, in General tab, change the name to dataflow to "ConvertCsvToDeltaFileDF".
in Dataflow activity, in settings, click on "+New" sign to create a new Dataflow, name the dataflow: "ConvertCsvToDeltaFileDF".
in Dataflow, add your source (CSV file) and in source settings point to your source (if it's in a blob storage you need to specify where it is located which container/folder etc.)
add a sink to your Dataflow, sink type: inline, dataset type: Delta, Linked service: your ADLS account , under sink settings tab, specify where do you want to save your delta files in your ADLS. PS: for the sake of this blog, I created a folder with a name "DeltaFiles" in my ADLS, we will save our delta files in it and use it later. Sink type:
Now in delta lake, you should see delta files as mentioned above.
Step 2: Query delta files using SQL serverless pool, in order to do it, you need to follow these steps:
Add your Storage account (ADLS) to Synapse azure workspace: on the left side, click on Data tab -> plus sign -> connect to external data -> add your ADLS account like so:
my container name is "testsally" and delta files are located under folder "DeltaFiles", click on DeltaFiles folder and select one of the parquet files, clok on New SQL script -> select Top 100 rows-> run script:
PS: you can also load the data into DF and query it in Spark Pool.
Build Dashboard using Power BI Integration in Azure Synapse, for now, we have in Synapse linked ADLS which includes metadata (delta files), and we would like to build a dashboard.
In order to connect Power BI to Synapse, we need to connect to a specific workspace in Power BI, in Power BI website-> click on Workspaces -> create new workspace -> name it "synapseDemoDeltaFiles"
In Synapse -> click on Manage -> Linked services -> PowerBI -> connect to workspace synapseDemoDeltaFiles and name the PowerBI linked service: "powerBIDemoDeltaFiles"
in Power BI desktop -> Get Data -> click on Azure Synapse Analytics SQL -> connect to your SQL data -> Load data, after this is done you should see in Model tab on the left side our table "metadataDeltaFiles"
Click on Report tab on the left side -> add a Table chart -> on the right side chose columns from your metadata table, you should now see loaded data into a chart in Power BI.