We are excited to announce that Multi-Column Distribution (MCD) for Azure Synapse Dedicated SQL pools is now Generally Available in the latest DW release! MCD is highly desirable for easing migrations, promotes faster query performance and reduces data skew.
To get the latest DW release, pause and resume your instance or wait for your maintenance window to start using MCD.
What is a hash distributed table?
Dedicated SQL pools in Azure Synapse Analytics distribute table rows across 60 distributions. Data from these distributions is abstracted to be viewed as a single table. Hash-distribution of one of the algorithms to distribute data improving query performance on large fact tables.
To learn more about data distribution read Distributed tables design guidance - Azure Synapse Analytics.
What is Multi-Column Distributed Tables?
You can choose to distribute data on multiple columns to balance the data distribution in your tables and reduce data movement during query execution. Choose MCD columns based on the query structures that are common in your workloads and are causing either base table or intermediate query execution data skew. For example, GROUP BY and/or JOIN on multiple columns can cause data skew when distributed on single column.
Multi-Column distribution will allow you to choose up to eight columns for distribution. This not only reduces the data skew over time but can also improve query performance in some cases.
How can you create Multi-Column Distributed Objects?
Starting with database compatibility level 50, MCD is supported. To change the database's compatibility level to 50 with this command. For more information on setting the database compatibility level, see ALTER DATABSE SCOPED CONFIGURATION.
ALTER DATABASE SCOPED CONFIGURATION SET DW_COMPATIBILITY_LEVEL = 50;
To load data into an MCD table, use CTAS statement and the data source needs to be Synapse SQL tables.
Multi-Column Distribution is supported with the following commands:
For example:
CREATE TABLE [dbo].[FactInternetSales]
( [ProductKey] int NOT NULL
, [OrderDateKey] int NOT NULL
, [CustomerKey] int NOT NULL
, [PromotionKey] int NOT NULL
, [SalesOrderNumber] nvarchar(20) NOT NULL
, [OrderQuantity] smallint NOT NULL
, [UnitPrice] money NOT NULL
, [SalesAmount] money NOT NULL
)
WITH
( CLUSTERED COLUMNSTORE INDEX
, DISTRIBUTION = HASH([ProductKey], [OrderDateKey], [CustomerKey] , [PromotionKey])
);
How can you check which columns are set as Distribution Columns in an Object?
To check if a column is a distribution column, use sys.pdw_column_distribution_properties.
Distribution ordinal (1-based),
- Ordinal = 0: not a distribution
- Ordinal >= 1: distribution columns
You can use the following script to determine distribution columns for an object:
SELECT dist_prop.object_id, dist_prop.column_id, col.name, dist_prop.distribution_ordinal
FROM sys.pdw_column_distribution_properties as dist_prop
JOIN sys.columns as col
on dist_prop.object_id = col.object_id and dist_prop.column_id = col.column_id
Where dist_prop.object_id = <object_id>
ORDER BY dist_prop.column_id
The image below is what the result will look like. If the value in the distribution_ordinal column is >= 1, the column is a distribution column.
Known limitations
Currently we support two paths to load data into multi-column distribution tables:
Loading option |
Supported |
CTAS from regular Azure Synapse SQL table |
Yes |
COPY statement using Openrowset
|
Yes |
COPY statement using polybase |
No |
CTAS from External Table |
No |
Insert from External Table |
No |
Thanks for reading and hope you enjoy using Multi-Column Distribution in Azure Synapse Dedicated SQL pools!
Reach out to me via Twitter @maariyali or LinkedIn at https://www.linkedin.com/in/alimariya
Mariya Ali
Product Manager
Azure Synapse Analytics