Azure Synapse Analytics Blog

5 MIN READ

Common Data Warehouse Functionality Redefined By CTAS

Microsoft

Jul 08, 2019

If you are new to Azure SQL DW, you may not be aware of some syntax that is unique to Microsoft MPP platforms. One specific syntactic feature is: “Create Table As Select” (CTAS).This might seem like it is the same as our old friend “Select Into” but the new syntax coupled with DW architecture offers a new world of possibilities when managing large quantities of data.

First, lets talk about the syntax. You can see right away from t-sql documentation that there is a little more to this functionality than the simple “Select Into”. You can easily add indexes, partitions, and change your table distribution in a single statement.

The real power of CTAS is realized in context of the data warehouse architecture and in loading/processing strategies. Azure SQL Data Warehouse offers neither explicit primary nor foreign key constraints. For users coming from a traditional RDB platform this is seen a limitation of the platform; however, this is a strength for the DW environment. Application databases are typically in 3NF where foreign and primary constraints are needed for integrity. In a DW environment, the integrity of the data is inherited from the source systems and denormalization demands that constraints are implied (rather than explicitly applied) to provide further performance gains on the platform.

Traditional waterfall approaches to database development where DDL is deployed and only DML is performed in production. New demands for performant big data environments call for new architectures. We can’t just deploy database structures that never change; instead, the data must be deployed and the structures/models that define the data must be fluid and change as business requirements change. In Azure SQL Data Warehouse, CTAS is the most powerful and efficient method to change the database structure to meet the needs of the business. In addition to changing database structures for business needs, data loading methodologies demand higher performance than traditional merge commands (which are not available in Azure DW).

Before digging into some examples of CTAS functionality, I need to mention the RENAME functionality. RENAME is the second half of the solution to fast data loading in Azure SQL Data Warehouse. RENAME eliminates the need for large amounts of code, ensures appropriate locking of objects and limits the impact that object locks can have on end users. CTAS plus RENAME become a very powerful combination in loading, managing, and administering your Azure SQL DW environment.

CTAS is highly optimized for Azure SQL Data Warehouse and a is full parallelized allowing for maximum throughput between compute nodes internal to the DW environment. When CTAS completes, the table level statistics are updated and accurate and clustered columnstore indexes have optimal compression. With that said, there are two important attributes to be aware of:

Columnstore index compression is based upon the resource class of the login performing the CTAS. Generally, it is a best practice to run CTAS statements under a medium or large resource class to provide enough memory for records to fit into the buffer prior to being compressed. This is especially important for wide tables.
CTAS does not copy statistics (other than the table level statistic). Any user or auto created statistics may need to be recreated manually as secondary steps of the CTAS prior to a RENAME operation.

Below are some examples of everyday tasks that can be done using a combination of CTAS and RENAME:

Delete Data

(This example deletes data with an expiration date older than 01/01/2000.)

CREATE TABLE XYZ

WITH (DISTRIBUTION=ROUND_ROBIN, CLUSTERED COLUMNSTORE INDEX)

SELECT * FROM ABC WHERE [ExpirationDate] >= ‘01/01/2000’;

RENAME OBJECT ABC TO ABC_OLD;

RENAME OBJECT XYZ TO ABC;

DROP TABLE ABC_OLD;

Update Data

(This example updates a field call ExpirationFlag where that record’s expiration date is older than 01/01/2000)

CREATE TABLE XYZ

WITH (DISTRIBUTION=ROUND_ROBIN, CLUSTERED COLUMNSTORE INDEX)

SELECT

[col1],

[col2],

1 as ExpiredFlag

FROM ABC

WHERE ExpirationDate < ‘01/01/2000’

UNION ALL

SELECT

[col1],

[col2],

[ExpiredFlag]

FROM ABC

WHERE ExpirationDate >= ‘01/01/2000’;

RENAME OBJECT ABC TO ABC_OLD;

RENAME OBJECT XYZ TO ABC;

DROP TABLE ABC_OLD;

Insert Data

(This example inserts records from a stage table into a production table.)

CREATE TABLE XYZ

WITH (DISTRIBUTION=ROUND_ROBIN, CLUSTERED COLUMNSTORE INDEX)

SELECT * FROM ABC

UNION ALL

SELECT * FROM ABC_Stage;

RENAME OBJECT ABC TO ABC_OLD;

RENAME OBJECT XYZ TO ABC;

DROP TABLE ABC_OLD;

Change Partitions/distribution/index on a table

(This example takes the existing table: ABC, and partitions it based upon the new partition definition.)

CREATE TABLE XYZ

WITH (DISTRIBUTION=HASH(col1),

CLUSTERED INDEX (col1),

PARTITION (COL2 RANGE LEFT FOR VALUES (10,20,30,40)))

SELECT * FROM ABC

RENAME OBJECT ABC TO ABC_OLD;

RENAME OBJECT XYZ TO ABC;

DROP TABLE ABC_OLD;

Concatenate fields

(This is a great example of how to create new fields in your table based upon existing fields or data. Here, we are creating a new field out of several fields and using the NewField as the distribution key.)

CREATE TABLE XYZ

WITH (DISTRIBUTION=HASH(NewField), CLUSTERED COLUMNSTORE INDEX)

SELECT col1+col2+col3 as NewField, *

FROM ABC;

RENAME OBJECT ABC TO ABC_OLD;

RENAME OBJECT XYZ TO ABC;

DROP TABLE ABC_OLD;

Change Datatype or nullability on a field

(What happens when you want to change the datatype or the nullability of a field in your table? The new table takes the field datatypes/nullability/sizes and transfers them to the new table. Below is a work around that will help you alter a table definition quickly when the field attributes change.)

1. Script out existing table and create a new blank table with the correct datatype or nullability.

Scripted from source:

CREATE TABLE ABC

(

col1 integer not null

,col2 varchar(50) not null

,col3 varchar(50) not null

,ExpirationFlag bit not null

,ExpirationDate datetime null)

WITH (DISTRIBUTION=ROUND_ROBIN, clustered columnstore index);

Modified (in bold):

CREATE TABLE ABC_Blank

(

col1 BIGINT not null

,col2 varchar(50) null

,col3 varchar(50) null

,ExpirationFlag bit not null

,ExpirationDate datetime null)

WITH (DISTRIBUTION=ROUND_ROBIN, clustered columnstore index);

2. Use a union statement to join blank table and existing table. The table created using a CTAS statement inherits the datatypes and nullability from the first table in the union statement:

CREATE TABLE XYZ

WITH (DISTRIBUTION=ROUND_ROBIN, CLUSTERED COLUMNSTORE INDEX)

SELECT * FROM ABC_BLANK

UNION ALL

SELECT * FROM ABC;

RENAME OBJECT ABC TO ABC_OLD;

RENAME OBJECT XYZ TO ABC;

DROP TABLE ABC_OLD;

Rebuild a clustered columnstore index with minimal user impact:

CREATE TABLE ABC_REBUILT

WITH (DISTRIBUTION=ROUND_ROBIN, CLUSTERED COLUMNSTORE INDEX)

SELECT * FROM ABC;