To Star or not to Star, More about data modeling in Power BI using ADX
Published Dec 05 2022 04:20 AM 3,335 Views
Microsoft

To Star or not to Star

More about data modeling in Power BI using ADX

 

Why more

 

I wrote in the past about best practices in data modeling in PBI using ADX in Direct Query.

I want to share my latest thoughts and recommendations.

I learned some new things about how Power BI behaves in certain cases, and it could make a big difference in performance.

Our goals are:

To star or not to star

 

Using Star Schemas

 

When I first looked at the issue of data modeling with ADX, I did not recommend creating a star schema with relations.

The reason was that the join queries created by default in Power BI are not optimized. (Understatement)

I’ve heard that some people were scared from tis recommendation because they were told again and again that star schema is the best model for Power BI

We took care of this problem as described at Return of the Power BI join with Kusto - Microsoft Community Hub

So now, using star schemas is perfectly fine as long as you specify all dimension tables as such

 

 

 

Single table

 

Using a single table with all data from dimensions already joined in is still a good option.

If you can enrich the main table with all necessary data during ingestion, this will be a very efficient approach.

Even well optimized joins are slower than no joins at all in query time.

It is especially recommended if you have very large (Millions) of items in a dimension.

You can add during ingestion columns form such very large dimensions while adding smaller dimensions to the model.

Some things you cannot do with a single table, for example if you have a line chart and you want to show items with no data in the chart, you can’t do it without having a dimension table.

If you don’t have sales on a specific day, this day will not show as zero in the chart.

 

What to avoid – semi star schema

What I mean by semi star schema, is a model in which there are dimensions and fact(s), but some non-value columns from the fact table are used in the visuals.

In the attached file, you can see two tables that use the same measure.

The measure is not very complex, but it is using an if statement.

As I explained here  I used DAX variables in the measure to avoid mentioning the same measure multiple times and triggering multiple summarize statements.

The only summarize query for both tables is: (I made the query simpler to read):

["Products"]

| project ["ColorName"],["ProductKey"]

| where ["ColorName"] == "Silver"

| join hint.strategy=broadcast kind=inner (["SalesFact"]

| project ["SalesAmount"],["DateKey"],["ProductKey"]

|  on $left.[“ProductKey"] == $right.[“ProductKey"]

| join hint.strategy=broadcast kind=inner  kind=inner (["Dates"]

|  on $left.["t0.DateKey"] == $right.["t2.Datekey"]

| summarize ["a0"]=sum(["t0.SalesAmount"]) by ["t2.Datekey"]

| limit 1000001

 

There is an additional query send for each table which is a dimension query to bring all values from the dimension , in thiscase it is all dates.

It differs between the tables

Where the datekey column is coming from the date table the query is:

 

["Dates"]

| summarize  by ["Datekey"]

| limit 1000001

 

In the case that the datekey is coming from the fact the query is :

 

["SalesFact"]

| summarize  by ["DateKey"]

| limit 1000001

 

In both cases the query returns all dates , but in the second case it does it from the fact which is typically much bigger.

Also notice that both queries do not use the filter on color so the entire fact table is scanned to return the list of dates.

Any filters not coming from columns on the same table will be ignored.

This difference can be very significant with large fact tables.

In the attached file you can see the difference using performance analyzer but it is not very big because the fact table has only a few million rows.

The extra dimension query can be optimized away sometimes with simple measures but it will be send in many cases.

Even in a case that a column has two values, and it seems unnecessary to create a dimension for it, don’t use the column from the fact in any position except values / measures.

Every I included here is equally valid for other sources in Direct Query mode.

I tested it against SQL Server and the results are the same.

 

 

 

 

 

 

Co-Authors
Version history
Last update:
‎Dec 05 2022 04:20 AM
Updated by: