Blog Post

Azure Synapse Analytics Blog
10 MIN READ

Historical monitoring dashboards for Azure Synapse dedicated SQL pools

NickSalch's avatar
NickSalch
Icon for Microsoft rankMicrosoft
Feb 02, 2023

Author: NickSalch is a Sr. Program Manager in the Azure Synapse Customer Success Engineering (CSE) team.

 

Introduction

There are many options available for monitoring your dedicated SQL pool and this blog is part of a series that will walk you through implementing a monitoring solution for Azure Synapse Analytics. This article will walk you through the steps to enable historical monitoring using Azure Monitor Workbook templates on top of Azure Metrics and Azure Log Analytics. Once the workbook is deployed, you will have a rich graphical, interactive set of reports that help you identify and investigate issues with your workload.  

 

 

 

Pre-requisites:

  1. A new or existing dedicated SQL Pool: Quickstart: Create and query a dedicated SQL pool (formerly SQL DW) in Azure synapse Analytics using the Azure portal
  2. A new or existing Log Analytics Workspace: Create Log Analytics workspaces - Azure Monitor

 

Enable Diagnostic Settings on your dedicated SQL pool

  1. Navigate to your dedicated SQL Pool in the Azure Portal
  2. Along the left pane, select the Diagnostic settings link
  3. Create a new diagnostic settings by clicking + Add Diagnostic setting

 

 

  1. Give the diagnostic setting a name like send to Log Analytics
  2. Select the categories Exec Requests and Request Steps. The other categories are not necessary for the workbooks we are deploying.
  3. Select Send to Log Analytics workspace and select the subscription and workspace you would like to send the data to

 

 

  1. Click the Save button.

 

NOTE: The first time this is enabled it can take a few hours before data starts flowing to the Log Analytics workspace. The workbook will not allow you to select the dedicated SQL Pool until data exist in the Log Analytics workspace, so that can be used as a check fi the data exists yet.

 

 

OPTIONAL: Enabled SQL Audit log to Log Analytics 

If you enable the SQL Audit log to go to the same log analytics workspace a new Audit tab will appear in the workbook to help you investigate the Audit log. 

 

  1. Navigate to your dedicated SQL Pool in the Azure Portal
  2. Along the left pane, select the Auditing link
  3. Flip the switch to Enable Azure SQL Auditing
    1. You can alternatively select the View Workspace Settings and enable auditing at the Workspace level. The configuration is the same, but it will be enabled for all dedicated Pools in the workspace.
  4. Select the Log Analytics checkbox
  5. Fill in the subscription and Log Analytics name for your configured log analytics workspace. It is recommended to send this to the same workspace so it can all be viewed in one workbook later, but you could deploy a separate workbook just to see the Audit tab.

 

 

 

Copy the workbook template from the Azure Synapse Toolbox

  1. Navigate to the Azure Synapse Toolbox on Github
  2. Select Monitor_Workbooks
  3. Select the proper workbook for your dedicated SQL Pool
    1. DedicatedSqlPool_workspace_vx.x.workbook: Use this template for a dedicated SQL Pool that was create as part of a Synapse workspace
    2. DedicatedSqlPool_Non-Workspace_vx.x.workbook: Use this template for a dedicated SQL Pool that was created standalone, outside of a Synapse workspace.
  4. After opening the template, click the Copy Raw button to ensure it is copied as plain text. Or you can click Raw and copy the raw code manually from the new page. 

 

 

Deploy the Azure Monitor Workbook

  1. Navigate to Monitor in the Azure Portal
  2. Along the left pane, select the Workbooks link
  3. Click the + New button to create an empty workbook 

 

 

  1. Along the top bar, select the </> button to open the advanced editor
  2. Replace the text in the window with the text copied from the Github template, then click Apply

 

 

Configure the workbook

  1. If the workbook is still in edit mode, click Done Editing along the top bar.
  2. Set the variables along the top of the workbook. When you make a selection for one variable it will use that choice to populate the choices in the next variable.
    1. TimeRange: The time range for data you would like to explore
    2. LogAnalyticsSubscription: Subscription that holds the log analytics workspace where diagnostics are configured to send data to
    3. LogAnalyticsWorkspace: Name of the Log Analytics Workspace where diagnostics are configured to send data to
    4. DatabaseResourceName: Workspace or server name + database name for the instance you want to monitor. -if the database you want to monitor is not in the list then there were no records found in the log analytics workspace for that database.
  3. Save the workbook and provide a name, it will save with the current variable selections.
  4. If desired, pin the workbook to an Azure dashboard. You can also pin individual charts from the workbook to an Azure dashboard. 

 

Updating the workbook

If you want to update your version of the workbook to the latest, simply follow the same steps again to deploy, except instead of creating a new workbook, click edit on the current workbook then go to advanced editor to paste in the new template.

 

Using the workbook

This workbook is frequently updated with new features, so it’s recommended you periodically look for the newest version on GitHub, especially if you encounter an issue with the workbook. The tabs and charts in this workbook can change over time as feedback is incorporated.

 

Overview

 

The Overview page is meant to give you a big picture of your workload for the time range selected. The first 4 charts are perhaps the quickest way to determine if your workload is running within normal range from day to day. The Overview page is meant to identify issues to be investigated on other tabs, not necessarily to deep dive into why something happened.

 

Things you may identify from these charts:

  • DWU Usage higher than normal
  • More queued queries than normal
  • TempDB spikes
  • Workload Group utilization hitting 100% causing queueing

 

If you discover any of these issues you may want to use other tabs to investigate the root-cause of the issue.  

 

Workload Management

The Workload Management page is meant to help investigate the effectiveness of your workload management configuration. You can see what percentage of your overall concurrency resources are used by each workload group and how many queries run under each group.

 

  • Workload Group Allocation by System Percent: Amount of system resources used by each workload group during the selected timeframe
  • Query Count by Workload Group: Number of queries executed in each workload group during the selected timeframe
  • Queued Queries by Workload Group: Identified which workload group had queued queries. You can use workload group importance to influence which groups queue first when there is contention
  • Individual workload group investigation: Select a particular workload group to see the usage vs the maximum configured. This is useful when workload groups have caps set less than 100% because you can use this chart to determine if this cap is being hit or not to help determine if it is an efficient configuration.

 

TempDB

Tempdb utilization is caused by 2 main factors:

 

  1. TempDB is where the results of data movement are stored during query execution – i.e. the result of a shuffle or broadcast move. If these intermediate data shuffles are large, it could indicate a query issue.
  2. If a query is assigned a workload that does not grant enough memory, the extra memory required is spilled to TempDB. In very bad scenarios in the case of very poorly written queries or missing statistics this spilling can be significant
  • TempDB Max and Avg Utilization: This is the overall tempdb Utilization for this dedicated pool. To investigate a spike, highlight a time before the spike to a time after the chart, leaving enough space so that records for the beginning and end of the query are captured. This will filter the rest of the charts and tables to show activities that were happening during this spike.

  • 20 Largest Query Steps by Most Rows Moved (graphical): This is a graphical representation of the number of rows moved between distributions during the query execution. The highest bars here likely contributed to the TempDB spike. Keep in mind that a broadcastMoveOperation uses a lot more space than a shuffleMoveOperation of the same number of rows due to the broadcast creating multiple copies of a table (replicate). BroadcastMoves at the top of this list are extra-interesting for investigating TempDB spikes.
  • 20 Largest Query Steps by Most Rows Moved (tabular): This is a tabular representation of the above chart. You should look at the operation type and Rowcount to determine which queries were likely using the most TempDB during that time. Keep in mind that this table is the STEPS of a query plan, by selecting a step in this list the next table will populate with the query that is running that step.
  • Selected Query: After selecting a query above, you will see the original query text for that query as well as what workload group it is assigned and how long it ran.
  • Query Plan: Provides all of the steps of the query plan for the selected query. This is useful to determine if there was a single large step or multiple or just to get an idea of the steps the query is running. For instance, if I am joining 2 tables of 100k rows, I do not expect to see a shuffle generating 1 billion rows, so that’s likely a query to investigate.

 

Replicated Tables

Replicated Tables page focuses on the history of internal jobs that keep your replicated tables up to date. This can be very useful in determining if a table is actually a good candidate for a replicated table.

 

As a quick review, replicated tables are initially created by the system as round robin, then the first time the table is queried a background process called BuildReplicatedTableCache is triggered that will cache a full copy of this table to the first distribution assigned to each compute node. Future queries can use the cached copy and benefit from the cached replicated version until there is a change made to the table, in which case the cached copies are invalidated and the process starts over.

 

  • Top 15 Most Rebuilds by Table: This table shows how often each table is rebuilt. The more often the cached version of the table is invalidated and rebuilt, the less likely it is that table benefits from being replicated. If a table is recached 20 times a day but is queried 200 times a day, then that number may be acceptable. However, if a query is recached 20 times a day and queried 20 times a day, then that query may not be benefiting at all from being replicated and could actually be adding unnecessary work to the service.
  • <various charts go into more detail about the rebuilds above>
  • Top 15 Elapsed Time per Rebuild: This table focuses on how long it takes to rebuild the cached version of a replicated table. In general, these operations should happen fast – in the seconds to minutes range. If the table is well above the 2GB guidance for the max size for replicated tables, then it may be taking so long to rebuild just because it is so large, in which case it may not be a good candidate for a replicated table. Look for tables in this list that take 10, 20,30+ minutes, select the table in the list and you will see the query plan for the cache.
  • Query Plan for Selected Query: All query plans for BuildReplicatedTableCache will look the same, but as part of this plan you can see how many rows are in the table. This number does not indicate the size of the table directly, but if you see the table has a 200 million rows, it’s unlikely that it is below the 2GB guidance for replicated tables. You should perform a DBCC PDW_SHOWSPACEUSED to check the size of the table. If it is well over 2GB then a different distribution method should be considered.

 

Query Investigation

 

Query Investigation is very useful for investigating a particular run for a query or comparing historical query runs to each other. If you know the RequestID for the query (found from sys.dm_pdw_exec_requests), then you search for that or you can search for a snippet of the query text. The query text search can search the first 4 thousand characters of the query text.

 

  • Matching Queries: All queries that match the search criteria used. Multiple executions are aggregated into one entry with the number of executions listed.
  • DSQL Command: After selecting a query from the list above, the first 4 thousand characters of query text will be listed..
  • All Executions of Selected Query Text: If there were multiple executions of this query text, each execution within the selected time will be listed separately
  • Query Plan for Selected Query: After selecting a query from all of the executions the query plan for that run will be displayed.
  • Triggered Auto-Stats Statements: For the selected query this will return data if there were autostats jobs that were triggered by that query.

 

Query Comparison

Query Comparison allows you to input 2 request IDs and see the plans side by side. This is useful in determining the difference between plans in 2 runs of the same query as well as determine where execution time was spent from one run to another.

 

Autostats

Autostats was created to help in the scenario where you see an autostats job executing, but do not know what query caused it to execute. This allows you to put in the request ID for that autostats name, or the name of the table the autostats are created on and it will show you the query that triggered that autostats job.

 

Query Audit

Query Audit allows you to filter to a particular user or application and see all queries executed against the database. After selecting a user, you can see all queries executed by that user within the selected timeframe.

 

Query Search

Query Search simplifies investigation into who executed a particular query. For instance, if there was a DROP TABLE ran by a user that dropped data that should not have been dropped, you can search for the table name or just DROP TABLE to find the query that ran the drop as well as the user that executed it.

 

Other monitoring blog posts:

 

Our team publishes blog(s) regularly and you can find all these blogs at https://aka.ms/synapsecseblog. For deeper level of understanding of Synapse implementation best practices, please refer to our Success by Design (SBD) site at https://aka.ms/Synapse-Success-By-Design

Updated Feb 02, 2023
Version 1.0
No CommentsBe the first to comment