Save money and increase performance with intelligent cache for Apache Spark in Azure Synapse

Microsoft

Mar 24, 2022

Data professionals can now save money and increase the overall performance of repeat queries in their Apache Spark in Azure Synapse workloads using the new intelligent cache, now in public preview. This feature lowers the total cost of ownership by improving performance up to 65% on subsequent reads of files stored in the available cache for Parquet files and 50% for CSV files.

How does this compare to the native Apache Spark cache?

Traditionally, when querying a file or table from your data lake, the Apache Spark engine in Synapse makes a call to your remote ADLS Gen2 storage for each read of the data. For workloads with frequent repeat queries, this process can be redundant and add latency to the overall processing time. Although Apache Spark provides a great caching feature, it must be manually set and released to minimize the latency and improve overall performance. It can also result in queries of stale data if the underlying data changes. This is where the intelligent cache in Azure Synapse can simplify the process; by automatically detecting changes to the underlying files and automatically refreshing them in the cache, you ensure you have access to the most recent data. When the cache reaches its size limit, it will automatically release the least-read data to make space for more recent data.

How do I add a cache to my new or existing Spark pools?

The intelligent cache setting is found in the "Additional settings" tab when creating a new Apache Spark pool and under "Scale settings" for existing Spark pools in Azure Synapse. The cache size can be adjusted based on the percentage of total disk size available for each Apache Spark pool. By default, the cache is set to disabled but it is as easy as moving the slider bar from 0 (disabled) to the desired percentage for your cache size to enable it. We reserve a minimum of 20% of available disk space for data shuffles. For shuffle-intensive workloads, you can minimize the cache size or disable the cache. We recommend starting with a 50% cache size and adjust as necessary.

To try this out, browse under the "Additional settings" tab when creating a new Apache Spark pool: