Multi-Column Distribution for Dedicated SQL pools is now GA!
Published Mar 22 2023 08:00 AM 16.6K Views
Microsoft

We are excited to announce that Multi-Column Distribution (MCD) for Azure Synapse Dedicated SQL pools is now Generally Available in the latest DW release! MCD is highly desirable for easing migrations, promotes faster query performance and reduces data skew. 

 

To get the latest DW release, pause and resume your instance or wait for your maintenance window to start using MCD.

 

CSE Office Hours - MCD.gif

 

What is a hash distributed table?

Dedicated SQL pools in Azure Synapse Analytics distribute table rows across 60 distributions. Data from these distributions is abstracted to be viewed as a single table. Hash-distribution of one of the algorithms to distribute data improving query performance on large fact tables.

 

sswoeng_0-1679442383551.png

 

To learn more about data distribution read Distributed tables design guidance - Azure Synapse Analytics.

 

What is Multi-Column Distributed Tables?

CSE Office Hours - MCD2.gif

You can choose to distribute data on multiple columns to balance the data distribution in your tables and reduce data movement during query execution. Choose MCD columns based on the query structures that are common in your workloads and are causing either base table or intermediate query execution data skew. For example, GROUP BY and/or JOIN on multiple columns can cause data skew when distributed on single column.

 

Multi-Column distribution will allow you to choose up to eight columns for distribution. This not only reduces the data skew over time but can also improve query performance in some cases.

 

How can you create Multi-Column Distributed Objects?

Starting with database compatibility level 50, MCD is supported.  To change the database's compatibility level to 50 with this command. For more information on setting the database compatibility level, see ALTER DATABSE SCOPED CONFIGURATION.

 

ALTER DATABASE SCOPED CONFIGURATION SET DW_COMPATIBILITY_LEVEL = 50;

 

To load data into an MCD table, use CTAS statement and the data source needs to be Synapse SQL tables.

Multi-Column Distribution is supported with the following commands:

  1. CREATE MATERIALIZED VIEW
  2. CREATE TABLE
  3. CREATE TABLE AS SELECT

 

For example:

 

 

CREATE TABLE [dbo].[FactInternetSales]
(   [ProductKey]            int          NOT NULL
,   [OrderDateKey]          int          NOT NULL
,   [CustomerKey]           int          NOT NULL
,   [PromotionKey]          int          NOT NULL
,   [SalesOrderNumber]      nvarchar(20) NOT NULL
,   [OrderQuantity]         smallint     NOT NULL
,   [UnitPrice]             money        NOT NULL
,   [SalesAmount]           money        NOT NULL
)
WITH
(   CLUSTERED COLUMNSTORE INDEX
,  DISTRIBUTION = HASH([ProductKey],   [OrderDateKey],   [CustomerKey] ,   [PromotionKey])
);

 

 

 

 

How can you check which columns are set as Distribution Columns in an Object?

To check if a column is a distribution column, use sys.pdw_column_distribution_properties.

 

Distribution ordinal (1-based),  

  • Ordinal = 0:  not a distribution 
  • Ordinal >= 1: distribution columns 

 

You can use the following script to determine distribution columns for an object:

 

 

SELECT dist_prop.object_id, dist_prop.column_id, col.name, dist_prop.distribution_ordinal 
FROM sys.pdw_column_distribution_properties as dist_prop 
JOIN sys.columns as col 
on dist_prop.object_id = col.object_id and dist_prop.column_id = col.column_id 
Where dist_prop.object_id = <object_id>
ORDER BY dist_prop.column_id

 

 

 

The image below is what the result will look like. If the value in the distribution_ordinal column is >= 1, the column is a distribution column.

sswoeng_1-1679443044370.png

Known limitations

  1. DacFx support for MCD tables will be added soon.
  2. CTAS on MCD HEAP target tables is not supported. Please use INSERT SELECT as a workaround to load data into MCD HEAP tables.
  3. Currently we support two paths to load data into multi-column distribution tables:

Loading option 

Supported 

CTAS from regular Azure Synapse SQL table 

Yes 

COPY statement using Openrowset

Yes

COPY statement using polybase 

No 

CTAS from External Table 

No 

Insert from External Table 

No 

 

Thanks for reading and hope you enjoy using Multi-Column Distribution in Azure Synapse Dedicated SQL pools!

 

Reach out to me via Twitter @maariyali or LinkedIn at https://www.linkedin.com/in/alimariya

 

Mariya Ali
Product Manager
Azure Synapse Analytics

 

14 Comments
Co-Authors
Version history
Last update:
‎Mar 30 2023 02:49 PM
Updated by: