Part of the complexity with this case was that there were many unrelated timeouts with multiple applications across the board and Power BI data refreshes running consistently all day long, so we had the needle in a haystack dilemma.
Some timestamps in the Power BI logs will be presented as local timezone while others are displayed in UTC. Be certain on the time zone of the particular timestamp you are researching to have correct facts to continue with the investigation.
In this support case, our customer didn’t manually code the queries. They relied on Power BI Desktop to automatically build the queries that they needed for their dataset, so we didn’t have the query text available to match against a query hash and track down the executions via Query Store.
Fortunately, Power BI Desktop has a built-in tracing mechanism.
After you have configured tracing, you can perform a manual refresh with Power BI Desktop and then proceed to inspect the log files.
These files are located by default on the %LOCALAPPDATA%\Microsoft\Power BI Desktop\Traces folder
The log files will have the queries that Power BI sent to the database. Also, we can see the command timeout set by Power BI on each query execution. I observed in my testing that the default is set to 10 minutes. However, you can override the default in Power BI desktop using the advanced options
Using the PowerShell script below, I extracted all the queries from the log files to a single text file
The queries are straightforward, mostly they are getting rows from each table that was selected for the dataset as well as metadata attributes of the schema (tables, columns, indexes, etc.)
After getting the query text we generated an estimated query plan for each one and inspected the properties window to get the query hash generated by the SQL engine.
Using the query hash, we reviewed “sys.query_store_runtime_stats” checking for timeouts. Interestingly the queries timing out were only extracting metadata information and not the actual data from the user defined tables. Reviewing sys.query_store_wait_stats we saw that the main wait was on a lock.
As this problem happened intermittently and not knowing when the issue may surface again. We turned to a fantastic free and open-source tool sp_whoisactive written by Adam Machanic.
This stored procedure can be used to check what is running right now on the database. It has many options that can be toggled via parameters. It has a very convenient parameter to help us analyze the blocking chain and a parameter to capture the output to a table for post-mortem analysis.
Using an on prem SQL Server that the customer had in their infrastructure we created a linked server to Azure SQL Database and using the SQL agent on premises we created a job running every 10 minutes to capture the output of sp_whoisactive.
After a few days passed by, the issue surfaced again and we checked the table where we stored the output of sp_whoisactive.
We observed a similar pattern like the one below which I reproduced in my lab
There was a lead blocker session executing a long running query issued from one of our customer's applications, that was blocking a truncate statement issued by the same application which in turn was blocking the Power BI query (program name = "Mashup engine"). Once the execution timeout expired Power BI terminated the connection and sent the error to the end user.
With sp_whoisactive we can see that these sessions hold incompatible locks hence why the query has to wait for the release of the lock
At this point the customer turned their attention to the lead blocker query to optimize it and reduce the impact of the blockings