Boosting Power BI Performance with Azure Databricks through Automatic Aggregations

Microsoft

Oct 28, 2024

This post is authored in conjunction with Yatish Anand, Senior Solutions Architect at Databricks, and Andrey Mirskiy, Senior Specialist Solutions Architect at Databricks.

Figure 1. Power BI automatic aggregations overview, source.

Introduction

In today’s fast-paced data landscape, timely insights can make all the difference. Power BI’s Automatic Aggregations feature is a breakthrough, designed to push performance boundaries by delivering low-latency query results on large datasets. It harnesses AI-powered caching to streamline DirectQuery models, combining the best-in-class performance of Power BI on Azure Databricks with low-latency BI for modern reporting needs. Used with DirectQuery mode, Automatic Aggregations do not face data volume limitations and allow you to scale regardless of data size without compromising on BI performance.

This innovation makes it easy for Power BI users of all skill levels to tap into advanced performance without worrying about backend strain or complex data modeling. Imagine your reports updating in real-time, even with billions of records in play. With this approach, you can deliver actionable insights faster than ever, freeing up time to focus on your business, not your data infrastructure. In this blog, we will showcase the integration of Power BI Automatic Aggregation with Azure Databricks and how this integration will help improve the performance of your Power BI reports.

What Are Automatic Aggregations?

Automatic aggregations streamline the process of improving BI query performance by maintaining an in-memory cache of aggregated data. This means that a substantial portion of report queries can be served directly from this in-memory cache instead of relying on the backend data sources. Power BI automatically builds these aggregations using AI based on your query patterns and then intelligently decides which queries can be served from the in-memory cache and which are routed to the data source through DirectQuery, resulting in faster visualizations and reduced load on the backend systems.

Key Benefits of Automatic Aggregations

Faster Report Visualizations: Automatic aggregations optimize most report queries by caching aggregated query results in advance, including those generated when users interact with reports. Only outlier queries that cannot be resolved via the cache are directed to the data source.
Balanced Architecture: Compared to using pure DirectQuery mode, automatic aggregations enable a more balanced approach. Most frequently used queries are served from Power BI query in-memory cache, which reduces the processing load on data sources during peak reporting times, improves scalability, and decreases costs.
Simplified Setup: Model owners can easily activate automatic aggregations and schedule regular refreshes. Once the initial training and refresh are complete, the system autonomously develops an aggregation framework tailored to the specific queries and data patterns.

Configuring Automatic Aggregations

Setting up automatic aggregations is straightforward. Users can enable the feature in the model settings and schedule one or more refresh operations. It is essential to review comprehensive guidelines on how automatic aggregations function to ensure they are suitable for your specific environment.

Figure 2. Enabling automatic aggregations, source.

Once configured, Power BI will utilize a query log to track user interactions and optimize the aggregations cache over time. The training operation, which evaluates query patterns, occurs during the first scheduled refresh, allowing Power BI to adapt to changing usage patterns.

Requirements for Automatic Aggregations

Automatic aggregations are compatible with several Power BI plans, including:

Power BI Premium per capacity
Fabric F Sku Capacity
Power BI Premium per user
Power BI Embedded models

Automatic aggregations are specifically designed for DirectQuery models, including composite models that utilize both import tables and DirectQuery connections.

Automatic Aggregation Walkthrough with Azure Databricks Integration

In this example we will showcase how to enable Automatic Aggregations on Power BI semantic models and train Automatic Aggregations in order to enhance the performance of reports using Azure Databricks as a data source.

Pre-requisites

Before you begin, ensure you have the following:

An Azure Databricks account, access to an Azure Databricks workspace, and a Databricks SQL Warehouse.
Power BI Desktop installed on your machine. The latest version is highly recommended.
Power BI workspace
DAX Studio or any other DAX parser tool

Step by Step Instructions

1. Create an initial Power BI semantic model based on samples catalog, tpch schema. Add tables and relationships as shown on the screenshot below. The dimension tables customer and nation should be set to Dual storage mode. The fact tables orders and lineitem should be set to DirectQuery storage mode. Below is the data model for the sample report.

For Best practices around Power BI storage mode please refer to this Git repo

2. Create a simple tabular report showing the count of orders and min shipment date, sum of discounts and sum of quantities. Also add the slicer with nation names, as shown below.

3. Now publish this report to a Power BI workspace.

4. As shown below when we run the report it Power BI takes ~20 sec to run the query. Below is the snapshot from the Network Trace:

Also below screenshot shows query hit the Databricks SQL Warehouse and read 38M records.

5. Enable the Automatic Aggregations in the semantic model settings. You can set the Query coverage according to your needs. This setting will increase the number of user queries analyzed and considered for performance improvement. The higher percentage of Query coverage will lead to more queries being analyzed, hence higher potential benefits, however, aggregation training will take longer.

6. For Power BI to be able to create aggregations, we need to populate the Power BI query log which stores internal queries created by Power BI when users interact with a report. Thus, you can either open the deployed Power BI Report and interact with the report by selecting different nation names in the slicer or you can open the DAX studio and run the sample DAX query mentioned below.
Please note that for better model training you need to set different values for the slicer or the filter in DAX-query and run it multiple times.

TREATAS({"BRAZIL"}, 'nation'[n_name])

One of the guidelines to populate a query log is that before making a report available to users , the report publisher should open the report and try with different slicer filters. In our scenario as mentioned above we populated the query log by selecting the different names in the report slicer . This step would help end user have faster report rendering.

7. You can now start the model training manually or schedule it.

8. Once the model is trained, Power BI will have aggregated values in in-memory cache. The next time you interact with the report using similar patterns (dimensions, measures, filters) Power BI will leverage cached aggregations to serve the queries and will not send queries to Databricks SQL Warehouse. Hence, you may expect sub-second report refresh performance.

As shown in the below screenshot post enabling Automatic Aggregation we can see that the report visual is now getting rendered in ~1.6 sec as compared to 20 sec earlier. This is because the data is now getting read from query log cache.

Also as shown below there is no SQL query fired at the DBSQL as well

Monitoring and Managing Automatic Aggregations

Power BI continuously refines the in-memory aggregations cache through scheduled refreshes. Semantic model owners can choose to trigger training operations on demand if necessary. It’s also crucial to monitor the refresh history to ensure operations complete successfully and to identify any potential issues.

Power BI provides detailed refresh history logs that display the performance of each operation, enabling users to keep track of memory usage and other critical metrics

Conclusion

In today's data-driven world, the integration of Azure Databricks and Power BI Automatic Aggregations is a game-changer, delivering unparalleled performance for even the most demanding data environments. While Azure Databricks excels at processing multi-terabyte datasets, Automatic Aggregations uses AI on your query patterns to intelligently cache aggregates, dramatically accelerating performance and reducing costs. This combination addresses the limitations of Import and Direct Lake modes, which are limited at working with large volumes while enhancing the efficiency of DirectQuery models. As shown in our blog with Automatic Aggregation on DirectQuery models you can now get sub-second report performance without constantly querying the underlying data source. With this innovative approach, you can focus on delivering lightning-fast BI reports at any scale rather than manually tuning your semantic model.