Efficient Data Partitioning with Microsoft Fabric: Best Practices and Implementation Guide

Sally_Dabbah · ‎Aug 15 2023

Introduction:
Data partitioning is a data management technique used to divide a large dataset into smaller, more manageable subsets called partitions or shards. Each partition contains a portion of the data, and these partitions can be stored and processed independently. The primary goal of data partitioning is to improve performance, scalability, and manageability in large-scale data systems.
Imagine having a vast encyclopedia, and your task is to find a specific page containing a particular word. Instead of searching through the entire encyclopedia, you opt to navigate directly to the relevant section that might contain the desired information. This approach allows you to narrow down your search scope and quickly find the page you need. This concept of narrowing down the search space by dividing data into manageable sections is precisely what partitioning entails.

In this blog, I will present a proposed method for data partitioning using Fabric notebooks.

Part 1: Prepare Data for Partitioning
in this blog, I'm using taxi data provided in this link: taxi_data | Kaggle
downloaded it and I'm using file: yellow_tripdata_2016-03.csv
Since Datasets in fabric does not support on-prem files that is more than 1 GB, and my file is 2 GB, i uploaded the file into my ADLS account and linked my ADLS as a shortcut into my fabric lake house check documentation here:
Options to get data into the Lakehouse - Microsoft Fabric | Microsoft Learn

Part 2: Create a new notebook and load data into it

Navigate through the following steps:

1. Click on your workspace.
2. Select "create," then choose "notebook."
3. Inside your notebook, locate your lakehouse on the left side.
4. Navigate to your desired file within the lakehouse.
5. Load the chosen file into your notebook using the provided options.

Part 3: Transform and partition data.
Now our task involves extracting the year, month, and day from the timestamp. Afterward, we'll proceed to save the data as parquet files in the designated destination.

from pyspark.sql.functions import year,month,days,col,to_timestamp,date_format
from pyspark.sql.functions import to_date
from notebookutils import mssparkutils
from pyspark.context import SparkContext
from pyspark.sql import SparkSession


sc = SparkContext.getOrCreate()
spark = SparkSession(sc) 
#Load data from Lakehouse into notebook
df = spark.read.format("csv").option("header","true").load("Tables/ADLS-TestContainer/yellow_tripdata_2016-03.csv")

output_folder = 'output_partitioning' 
#click on your datalake -> Copy ABFS path -> add your suffix to it.
output_path = 'abfss://31e03aae-13c0-4c9b-a9a7-8941b7b73c98@msit-onelake.dfs.fabric.microsoft.com/8384d9c6-f10c-413e-a3d6-21b90a7f76f2/Tables/ADLS-TestContainer/{output_folder}'

#adding new columns to my dataframe - year,month and day 
df_output=df.withColumn("tpep_pickup_datetime",to_timestamp(col("tpep_pickup_datetime"))).withColumn("day", date_format(col("tpep_pickup_datetime"), "dd")).withColumn("year", date_format(col("tpep_pickup_datetime"), "yyyy")).withColumn("month", date_format(col("tpep_pickup_datetime"), "MM"))
df_output.select('day','month','year').show()

#write data into lakehouse using partitioning 
df_output.write.mode("overwrite").partitionBy('year','month','day').parquet(output_path)

Part 4: View the output within the lake house.
1. refresh your lake house - navigate to the left in your notebook
2. click on refresh
3. view files in your destination, in my case this was the output:

Since we linked our data lake into our lake house, if you open your ADLS storage account, you can see your files there like so:

Links:
Microsoft Fabric documentation - Microsoft Fabric | Microsoft Learn
Create a workspace - Microsoft Fabric | Microsoft Learn
Develop, execute, and manage notebooks - Microsoft Fabric | Microsoft Learn

Call-To-Action:

- Make sure to establish all connections before starting to work on Fabric.
- check Linkes mentioned above.

- Follow me for more tutorials on LinkedIn: Sally Dabbah | LinkedIn

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Efficient Data Partitioning with Microsoft Fabric: Best Practices and Implementation Guide