Digital Transformation will leverage massive data volumes to improve both the quality and cost of healthcare…doesn’t that sound easy? Large volumes of data and complex logic have long made healthcare analytics a challenge. New capabilities to address these requirements have been added to Power BI for use with powerful databases such as those in Azure. These new capabilities can help healthcare analytics teams achieve more. This article is the first in a series exploring how Power BI paired with Azure data tools creates a flexible, scale-able, and achievable healthcare analytics architecture:
#1 - Unleash Massive Healthcare Data Volumes to Analytics using Power BI Aggregations (This Article)
Healthcare reporting and analytics often require massive data volumes with tables that are both exceptionally wide and deep. Architectural concepts such as Large Type 2 Dimension tables, Many-to-Many Relationships, and Secondary Key Relationships are rare exceptions in most Business Intelligence designs yet are common in Healthcare. Numerous reference tables are often required that can also have complex architectures due to ever-changing categorization systems such as DRGs, ICDs, CPTs, and others. Add calculation logic such as qualifying visits, LOS calculations, complications, and cohort inclusion/exclusion criteria and the challenges becomes daunting. Successful solution design requires deep healthcare industry expertise along with powerful tools.
Have you been held back by data volume limitations in Power BI? Have Power BI Import Models hit the Pro 1GB and Premium 10GB limitations, while Direct Query Models demand too many resources from your source databases? Power BI’s new Aggregations feature combined with Composite Models can now help you unlock massive volumes of data for analytics.
One of the biggest challenges with Business Intelligence has always been organizing large volumes of data so that they are usable within the confines of technology. An ideal target architecture would allow data to be stored in one place and queried by users who interface with a purely logical metadata layer. Relational Databases, ROLAP Models, and most recently Direct Query Tabular Models all achieve this goal, but query complexity and user concurrency often create performance bottlenecks due to technical limitations of source databases. Here’s a simple diagram of a Direct Query Model for a tool such as Power BI:
Cached data models have been an effective workaround since the early days of OLAP/MOLAP technologies. Import Mode Tabular Models within SQL Server Analysis Services, Excel PowerPivot, and Power BI also leverage a cache of data to enable high query concurrency with complex logic. Import (Cached) Models have great performance, but caching extremely large volumes of data is not always practical or scale-able:
Power BI now offers Composite Models, which combine the best of both traditional options by allowing Import and Direct Query tables in the same Model. Tables with extremely large volumes of data or near-real-time requirements can be in Direct Query mode, and Tables that are suited to a cache can be in Import or Dual mode:
The three basic types of Power BI Models can be visualized as follows:
When querying an extremely large table of data, Direct Query mode can send multiple queries to the source database with every click of a filter on a report. If there are complex queries and high user concurrency, the demands placed on the source database can be extreme. Power BI Aggregations offers you a new way to seamlessly blend an Imported Summary Table and a Direct Query Detail Table within a Composite Model for an architecture that is obfuscated from both DAX calculation authoring and end users who are viewing reports. With Aggregations, an Import Mode Summary Table will be queried whenever logically possible and the full Direct Query Mode version of the table will only be queried when details not in the Summary Table are required. Consider the following scenario. Notice that with Aggregations there is no need to use Direct Query against the source for Query #1:
What about scenarios where details are needed that do not exist in the Aggregation? For Query #2 below, a query is sent via Direct Query mode since Procedures do not exist in the Aggregation:
If we were to use a logical model to explain what’s happening with Aggregations, it would look like this:
Are Aggregations just a fancy new version of a Summary table? Yes and No. With traditional Summary tables users would either need to select separate calculations for Summary and Detail reporting, or programmatically switch between the two tables using tricky MDX or DAX code. Power BI Aggregations are hidden from the end users, so a simple DAX expression written against the Patient Visit Table (such as Total Visits = SUM(‘Patient Visit Table’[Total Charges]) will automatically look to see if it can query the Aggregation table first, and only send a Direct Query if there is detail unavailable in the Aggregation such as a filter on Procedures. The examples provided in this article would most likely be part of a more detailed architectural design, such as the one below:
How does this impact the size of the Model published to Power BI? By reducing the volume of data in the cache, the size of the Model is greatly reduced. Here's the results from a test I ran on real healthcare data from the CMS Medicare Part D database:
Model | Columns in Storage | Row Count | Size |
Relational DB in Azure SQL DB | Physician, Specialty, City, State, Drug | 90,000,000 | 35.5 GB |
Power BI Import Model | All | 90,000,000 | 1.8 GB |
Power BI Composite Model w Aggs | Specialty, State, Drug | 750,000 | 35 MB |
Here's a video that discusses the Model mentioned above along with a demo:
In the example above, queries were lightning fast unless filter criteria for specific physicians or cities were selected. Only when that level of detail was needed would a query get sent back to the Azure SQL DB.
Let’s take this another step further! Aggregations can be Imported Summary tables, but they can also be in Dual Mode or Direct Query Mode. Direct Query Mode Aggregations can effectively be used as Indexes to optimize queries that are sent back to a source. Multiple Aggregations can also be added to the same source table, and Precedence can be given so that the most efficient ones are evaluated for queries first, then additional Aggregations can be queried before a true Direct Query is sent to the source:
In summary, Power BI Aggregations expands the possibilities for building models leveraging larger data volumes, while still allowing for great query performance and high user concurrency. Any questions of comments? Please leave a comment or reach out to me on Twitter or Linkedin.