Authors: @KranthiMedam, @InnovatorsClub, @dipakshaw
In every data engineering program, there is a need for upkeep on a Delta Lake. This blog presents a way to automate such a maintenance process in Synapse Analytics. We introduce a utility that auto detects all Delta Lake files in a Synapse workspace and then auto performs “Optimize” and “Vacuum” operations on those files. This process ensures the performance and size of a Delta Lake remains intact even with regular usage over time.
Delta Lake is an open-source storage framework that enables building a Lakehouse Architecture. Delta Lake provides several features such as ACID transactions, schema enforcement, upsert and delete operations, unified stream and batch data processing, and time travel (data versioning) that are incredibly useful for analytics on Big Data. Check out the official documentation to learn more on Delta Lake.
Delta Lake has become increasingly popular and is available to use inside of a Synapse workspace. In Synapse, Delta Lake can not only be read and written in Apache Spark, but also read through a Serverless SQL pool.
The useful features of Delta Lake come at the cost of requiring regular maintenance. Delta Lake requires periodic cleanup as it accumulates files over time with each upsert and retains previous snapshots of the data. They can quickly convert a small dataset (in MBs) to several GBs of storage. This is because deleted data is not really removed but retained as an older snapshot of the Delta Lake dataset.
Delta Lake provides two in-built methods to preserve performance and size of a dataset. These methods come in the form of the “Optimize” and “Vacuum” commands. As the accumulated files of Delta Lake snapshots increase in number it also leads to a degradation of performance, since each read would have to access and skim through multiple files in Delta Lake.
Consider a Delta Lake dataset of 10 GB, if overwritten once daily by an ETL process, within a month it accumulates to a size of 300 GB. This is because Delta Lake creates new versions of files with updated data and does not delete old files, leading to quick growth in data storage size. If we run the Vacuum process on this dataset daily, we can restrict the Delta Lake size to 70 GB. This helps restrict the Delta Lake size to a reasonable degree while allowing time travel of one week for analysis and troubleshooting purposes.
For the scenario reasons, Delta Lake needs regular upkeep by running Optimize and Vacuum periodically. Typically, this is done by listing out the various Delta Lake tables in a manually written script and scheduling it for execution. This script needs to be updated regularly to reflect any addition or removal of Delta datasets in the Synapse workspace. Any missed Delta Lake table in such a list is bound to provide reduced performance and gradual swelling in storage size.
To avoid the above-mentioned hazard, we have developed the Delta Lake Auto Maintenance process for Synapse. This script scans a Synapse workspace and detects all the Delta Lake tables in it. It then auto runs the Optimize and Vacuum commands on all the detected tables. The script considers the executor cores available on the Spark pool and runs optimize process through multiple threads in parallel. This ensures the Auto Maintenance process runs in an efficient manner based on the provided Spark pool resources.
The script provides the following configuration options to set:
The whole auto maintenance process is available as a script in a notebook in the Synapse Genie GitHub repository
Note: run time varies as per the number, size of Delta tables and the Spark pool used.
This utility is developed by the HR Data Insights team within Microsoft’s Digital Employee Experience organization. It provides a reusable and effortless way to clean up older versions of your datasets and maintain the size and performance of your Data Lake processing.
Delta Lake is quite useful in persistence of data to a storage and provides ACID like properties which are helpful for analytics and reporting scenarios. However, this comes at the cost of requiring regular maintenance through Optimize and Vacuum scripts.
The Genie Delta Lake Auto Maintenance scripts provide an automated process to ensure this overhead is done with the least effort and assures it is executed efficiently. Once the Auto Maintenance Job is scheduled in a Synapse workspace, it removes the need for manual updates and keeps a healthy Delta Lake.
Code Repository: Genie Delta Lake Auto Maintain
Synapse Resource Optimization: Genie Execution Framework
Spark Optimization: Optimize Spark in Synapse
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.