Strengthen Delta Lake in Synapse with auto maintenance job
Published Feb 10 2023 08:00 AM 8,263 Views
Microsoft

Authors: @KranthiMedam@InnovatorsClub@dipakshaw

 

In every data engineering program, there is a need for upkeep on a Delta Lake. This blog presents a way to automate such a maintenance process in Synapse Analytics. We introduce a utility that auto detects all Delta Lake files in a Synapse workspace and then auto performs “Optimize” and “Vacuum” operations on those files. This process ensures the performance and size of a Delta Lake remains intact even with regular usage over time.

 

InnovatorsClub_0-1675882145561.png

 

What is Delta Lake?

Delta Lake is an open-source storage framework that enables building a Lakehouse Architecture. Delta Lake provides several features such as ACID transactions, schema enforcement, upsert and delete operations, unified stream and batch data processing, and time travel (data versioning) that are incredibly useful for analytics on Big Data. Check out the official documentation to learn more on Delta Lake

 

Delta Lake has become increasingly popular and is available to use inside of a Synapse workspace. In Synapse, Delta Lake can not only be read and written in Apache Spark, but also read through a Serverless SQL pool.

 

Maintenance Needed for Delta Lakes

The useful features of Delta Lake come at the cost of requiring regular maintenance. Delta Lake requires periodic cleanup as it accumulates files over time with each upsert and retains previous snapshots of the data. They can quickly convert a small dataset (in MBs) to several GBs of storage. This is because deleted data is not really removed but retained as an older snapshot of the Delta Lake dataset.

 

Delta Lake provides two in-built methods to preserve performance and size of a dataset. These methods come in the form of the “Optimize” and “Vacuum” commands. As the accumulated files of Delta Lake snapshots increase in number it also leads to a degradation of performance, since each read would have to access and skim through multiple files in Delta Lake.

 

  • The “Optimize <table name>” command re-organizes active data in storage to a few files of optimal size. It compacts together multiple smaller parquet files in a Delta Lake into a larger parquet file which helps in enhanced reading speed of the dataset. While Optimize improves performance it does not reduce the size of a Delta Lake dataset.
  • The task to facilitate reduction in size is performed by the Vacuum command. The “Vacuum <table name>” command removes the previous versions of Delta Lake files and retains recent data history up to a specified period. It helps reduce the overall size by removing older unnecessary copies of data.

 

Scenario

Consider a Delta Lake dataset of 10 GB, if overwritten once daily by an ETL process, within a month it accumulates to a size of 300 GB. This is because Delta Lake creates new versions of files with updated data and does not delete old files, leading to quick growth in data storage size. If we run the Vacuum process on this dataset daily, we can restrict the Delta Lake size to 70 GB. This helps restrict the Delta Lake size to a reasonable degree while allowing time travel of one week for analysis and troubleshooting purposes.

 

For the scenario reasons, Delta Lake needs regular upkeep by running Optimize and Vacuum periodically. Typically, this is done by listing out the various Delta Lake tables in a manually written script and scheduling it for execution. This script needs to be updated regularly to reflect any addition or removal of Delta datasets in the Synapse workspace. Any missed Delta Lake table in such a list is bound to provide reduced performance and gradual swelling in storage size.

 

Genie auto maintenance of Delta Lake

To avoid the above-mentioned hazard, we have developed the Delta Lake Auto Maintenance process for Synapse. This script scans a Synapse workspace and detects all the Delta Lake tables in it. It then auto runs the Optimize and Vacuum commands on all the detected tables. The script considers the executor cores available on the Spark pool and runs optimize process through multiple threads in parallel. This ensures the Auto Maintenance process runs in an efficient manner based on the provided Spark pool resources.

 

The script provides the following configuration options to set:

  • Data retention time for Vacuum
  • Blocklist databases and tables to avoid maintenance
  • Criteria to select Delta Lake tables with a minimum of updates
  • Degree of parallelism in executing tasks

 

Benefits of auto maintenance on Delta Lake

  • Auto Optimize operation ensures the performance of Delta Lake stays healthy and efficient
  • Auto Vacuum operation prevents bloating of storage size and its associated cost
  • Removes manual effort to create and update scripts for Delta Lake maintenance
  • Efficiently executes Optimize and Vacuum operations based on the provided Spark pool resources
  • Provides a straightforward way to configure parameters such as data retention time, data versions, etc.
  • Options to blocklist certain tables from auto maintenance to ensure they have a different audit/time travel as required

 

Setup and run Delta Lake auto maintenance process

The whole auto maintenance process is available as a script in a notebook in the Synapse Genie GitHub repository

  • The notebook can be directly uploaded to any Synapse workspace and run
  • It is recommended to embed the notebook in a Synapse pipeline and set up a trigger to run it weekly during low usage periods
  • Expected runtime on a sample workspace with about 430 Delta tables containing a combined data size of 1 TB and a Spark pool of three small nodes is about 50 mins

Note: run time varies as per the number, size of Delta tables and the Spark pool used.

 

Genie Delta Lake Auto Maintenance scripts

This utility is developed by the HR Data Insights team within Microsoft’s Digital Employee Experience organization. It provides a reusable and effortless way to clean up older versions of your datasets and maintain the size and performance of your Data Lake processing.

 

Conclusion

Delta Lake is quite useful in persistence of data to a storage and provides ACID like properties which are helpful for analytics and reporting scenarios. However, this comes at the cost of requiring regular maintenance through Optimize and Vacuum scripts.

 

The Genie Delta Lake Auto Maintenance scripts provide an automated process to ensure this overhead is done with the least effort and assures it is executed efficiently. Once the Auto Maintenance Job is scheduled in a Synapse workspace, it removes the need for manual updates and keeps a healthy Delta Lake.

 

Appendix

Code Repository: Genie Delta Lake Auto Maintain

Synapse Resource Optimization: Genie Execution Framework

Spark Optimization: Optimize Spark in Synapse

 

1 Comment
Version history
Last update:
‎Feb 09 2023 06:36 PM
Updated by: