Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 3

Kruti_Mehta · ‎Nov 22 2023

This article is intended for audience who are considering options to move their data into Azure and prefer T-SQL to query the data.

Overview

This is a follow-up blog to Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 1 & Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 2. By the time you get to this article you would have narrowed down your choices either to an OLTP use case or a Data Warehouse use case.

If the choice is leaning towards Azure Synapse SQL you need to understand the following key decision factors

A) Files & File Groups

One of the major differences between Synapse SQL and Azure SQL is the way they work with storage. For Azure SQL, Storage is in terms of traditional SQL files (Data files and Log files)

Where as in Azure Synapse SQL, storage is in terms of volume of data. It does not take into account traditional SQL files (Data, Log). It can take any file formats that ADLS Gen2 supports today. The data is stored in tables defined on files in the data lake are seamlessly consumed by either Spark or Hive. It removes the traditional technology barriers between using SQL and Spark together. It provides a distributed query system for T-SQL to work with actual files.

B) T-SQL Differences

Given how each of these Azure Synapse SQL offerings work with files in Psuedo RDBMS/SQL fashion there are expected T-SQL feature differences. This is due to the change in the underlying engine. Synapse SQL uses a node-based architecture. The Control node is the brain of this architecture. It's the front end that interacts with all applications and connections. The Azure Synapse SQL Control node utilizes a distributed query engine to optimize queries for parallel processing,. The Control node then passes these operations to Compute nodes to do their work in parallel. It's always good to visit this page for the updated list of platform differences.

Concept	Dedicated SQL Pool Support	Serverless SQL Pool Support
Connections	Yes	Yes
Resource classes and concurrency	Yes (Preferred than using default)	No
Transactions	Yes	No
User-defined schemas	Yes	Yes
Table distribution	Yes	No
Table indexes	Yes	No
Table partitions	Yes	No
Statistics	Yes	Yes
CTAS	Yes	No
External tables	Yes	Yes
CETAS	Yes	Yes
Stored procedures	Yes	Yes
Labels	Yes	No
Views	Yes	Yes
Temporary tables	Yes	Yes
Dynamic SQL	Yes	Yes
Looping	Yes	Yes
Group by options	Yes	No
Variable assignment	Yes	Yes

This list is often overlooked or missed during decision making. Total cost of engine shift (On-Premises or anywhere else to Azure) comprises of 2 aspects : direct cost benefits & cost for refactoring efforts. The general thumb rule here is if the cost of refactoring is greater than cost benefits you gain from choosing a particular service (with no actual current use case of additional features offered by the new engine), don't change the engine.

C) Purchasing Models

A Synapse SQL pool represents a collection of analytic resources that are being provisioned. Compute is separate from storage, which enables you to scale compute independently of the data in your system. Synapse SQL uses Azure Storage to keep your user data safe. Since your data is stored and managed by Azure Storage, there's a separate charge for your storage consumption.

The Compute nodes provide the computational power.

When you submit a T-SQL query to dedicated SQL pool, the Control node transforms it into queries that run against each distribution in parallel.
- A distribution is the basic unit of storage and processing for parallel queries that run on distributed data in dedicated SQL pool.
- When dedicated SQL pool runs a query, the work is divided into 60 smaller queries that run in parallel. Each of the 60 smaller queries runs on one of the data distributions. Each Compute node manages one or more of the 60 distributions.
- A dedicated SQL pool with maximum compute resources has one distribution per Compute node. A dedicated SQL pool with minimum compute resources has all the distributions on one compute node.
- Data Movement Service (DMS) is the data transport technology in dedicated SQL pool that coordinates data movement between the Compute nodes.

When you submit a T-SQL query to serverless SQL pool, the Control node transforms into smaller queries called tasks.
- Each small query is called task and represents distributed execution unit.
- These tasks run against each compute node in parallel on set of files.

Serverless

Serverless SQL pool is serverless, hence there's no infrastructure to setup or clusters to maintain. A default endpoint for this service is provided within every Azure Synapse workspace, so you can start querying data as soon as the workspace is created. Its essentially a logical data warehouse that provides a relational abstraction on top of raw or disparate data.

Explicitly you don't have relocate and transform data before consuming allowing you to have an up-to-date view of your data as it is stored in workspace. Thanks to built-in query execution fault-tolerance, the system provides high reliability and success rates even for long-running queries involving large data sets.

Data Loading

Users have the following options to access storage:

OPENROWSET function that enables ad-hoc queries over the files in Azure Storage.
External table that is a predefined data structure built on top of set of external files.

Workloads

Using Serverless SQL you can Query data in the Azure Data Lake (Parquet, Delta Lake, delimited text formats), Azure Cosmos DB, or Dataverse.
Its preferred if the files are in the range between 100MB to 10GB.
Since it doesn't have local storage or ingestion capabilities caching results on the client side by using Power BI import mode or Azure Analysis Services is preferred.
One serverless SQL pool can concurrently handle 1,000 active sessions that are executing lightweight queries.
The numbers will drop if the queries are more complex or scan a larger amount of data, so in that case consider decreasing concurrency and execute queries over a longer period of time if possible
Its suitable for built for large-scale data and computational functions for ad hoc queries which don't require saving the results or re-visiting them frequently. However you do have the option of using CREATE EXTERNAL TABLE AS SELECT (CETAS) to store the query results to storage.

Today the only supported output types are Parquet and CSV as of when this article is published. 
The data in the external table cannot be modified.

Pricing methodology

Its a “Pay-per-query” model or in other words you only pay for the amount of data that is processed. Serverless SQL pool will not incur charges until you run queries.

Dedicated

Dedicated SQL pool (formerly SQL DW) represents a collection of analytic resources that are provisioned when using Synapse SQL. Analytic resources are defined as a combination of CPU, memory, and IO. These three resources are bundled into units of compute scale called Data Warehouse Units (DWUs). The size of a dedicated SQL pool (formerly SQL DW) is determined by Data Warehousing Units (DWU). Dedicated SQL pool (formerly SQL DW) stores data in relational tables with columnar storage as compared to traditional T-SQL tables which store data in row storage.

 - Row Store is when data that's logically organized as a table with rows and columns, and physically stored in a row-wise data format. 
All of the attributes of a particular row are stored together in the same physical block of data. 
Row store refers to a table where the underlying data storage format is a heap, a clustered index, or a memory-optimized table.

- Column Store is when data is data that's logically organized as a table with rows and columns, and physically stored in a column-wise data format. 
A clustered column store index is the physical storage for the entire table. 
You can create nonclustered B-tree or rowstore indexes as secondary indexes on a clustered columnstore index.

Data Loading

Dedicated SQL pool supports loading and exporting data through many loading methods. These include non-PolyBase options (BCP and SQLBulkCopy API), and PolyBase options. PolyBase is the fastest and most scalable Data Warehouse loading method to date, hence its recommend as your default loading mechanism.

PolyBase loads data from UTF-8 and UTF-16 encoded delimited text files. For the fastest load, use compressed delimited text files. PolyBase also loads from the Hadoop file formats RC File, ORC, and Parquet.
PolyBase can also load data from Gzip and Snappy compressed files.
PolyBase currently does not support extended ASCII, fixed-width format, and nested formats such as WinZip, JSON, and XML.
PolyBase can't load rows that have more than 1,000,000 bytes of data. Hence you should Split large compressed files into smaller compressed files.

To load data with PolyBase, you can use any of these loading options:

PolyBase with T-SQL works well when your data is in Azure Blob storage or Azure Data Lake Store. It gives you the most control over the loading process, but also requires you to define external data objects. The other methods define these objects behind the scenes as you map source tables to destination tables. To orchestrate T-SQL loads, you can use Azure Data Factory, SSIS, or Azure functions.
PolyBase with SSIS works well when your source data is in SQL Server. SSIS defines the source to destination table mappings, and also orchestrates the load. If you already have SSIS packages, you can modify the packages to work with the new data warehouse destination.
PolyBase with Azure Data Factory (ADF) is another orchestration tool. It defines a pipeline and schedules jobs.
PolyBase with Azure Databricks transfers data from an Azure Synapse Analytics table to a Databricks dataframe and/or writes data from a Databricks dataframe to an Azure Synapse Analytics table using PolyBase.

An external table is similar to a database view. Defining external tables involves specifying the data source, the format of the text files, and the table definitions. What follows are the T-SQL syntax topics that you'll need:

BCP loads directly to dedicated SQL pool without going through Azure Blob storage, and is intended only for small loads.

If your data is not compatible with PolyBase, you can use BCP or the SQLBulkCopy API.
You should consider increasing batch size for better throughput.
A batch size between 100 K to 1M rows is the recommended baseline for determining optimal batch size capacity.

Workloads

Column store indexes are the preferred data storage format for data marts, data warehousing and analytics workloads.
Its efficient for queries that tend to perform operations on large ranges of values rather than looking up specific values.
Once data is stored, you can run analytics at massive scale. Compared to traditional database systems, analysis queries finish in seconds instead of minutes, or hours instead of days.
Since it has local storage or ingestion capabilities you can leverage the Direct Query mode in PowerBI and let the pool handle major computations for aggregations.

Pricing methodology

Its a “Pay-by -DWU” model or in other words you're charged for it based on the number of DWU blocks and hours running. You can control costs for a dedicated SQL pool by pausing the resource when it is not is use.

Basic Purchasing Guidelines for both the Models

When you create or use Azure Synapse Analytics resources, you might get charged for the following meters:

Data Exploration & Data Warehousing
- Dedicated SQL Pool - You're charged for it based on the number of DWU blocks and hours running.
- Storage - You're charged for it based on the number of TBs stored.
- Serverless SQL pool - You're charged by the TB of data processed.
Apache Spark Pool - You're charged per vCore hour and prorated by the minute.
Data Integration
- Orchestration Activity Runs - You are charged based on the number of activity runs.
- Data Movement – For copy activities run on Azure Integration Runtime, you are charged based on number of DIU used and execution duration.
- Data Flows vCore Hours – for data flow execution and debugging, you are charged for based on compute type, number of vCores, and execution duration.

After you delete Azure Synapse resources, the attached Data Lake Storage Gen2 might continue to exist. They continue to accrue costs until you delete them. Its always better to visit Azure Pricing Calculator and Azure Synapse Plan and Manage Cost documentation before arriving at a decision

D) Licensing

Azure Synapse brings together the best of

SQL technologies used in enterprise data warehousing,
Spark technologies used for big data,
Data Explorer for log and time series analytics,
Pipelines for data integration and ETL/ELT, and
deep integration with other Azure services such as Power BI, CosmosDB, and AzureML.

Synapse SQL uses a node-based architecture where you pay for computer and there is a separate charge for your storage consumption. There is no licensing cost if you use one or all of the above services listed.

In the next article Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 4 , we will cover the following aspects

E) Backup and Restore

F) High Availability

Followed by Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 5 that discusses the last key factor consideration for this series for G) Performance Tuning

Good Reads:

References:

Don't forget to share a if this helps

Credit: Thanks Ben Harding for review and guidance

FastTrack for Azure: Move to Azure efficiently with customized guidance from Azure engineering.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 3