This article is intended for audience who are considering options to move their data into Azure and prefer T-SQL to query the data.
Overview
This is a follow-up blog to Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 1 & Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 2. By the time you get to this article you would have narrowed down your choices either to an OLTP use case or a Data Warehouse use case.
If the choice is leaning towards Azure Synapse SQL you need to understand the following key decision factors
A) Files & File Groups
One of the major differences between Synapse SQL and Azure SQL is the way they work with storage. For Azure SQL, Storage is in terms of traditional SQL files (Data files and Log files)
Where as in Azure Synapse SQL, storage is in terms of volume of data. It does not take into account traditional SQL files (Data, Log). It can take any file formats that ADLS Gen2 supports today. The data is stored in tables defined on files in the data lake are seamlessly consumed by either Spark or Hive. It removes the traditional technology barriers between using SQL and Spark together. It provides a distributed query system for T-SQL to work with actual files.
B) T-SQL Differences
Given how each of these Azure Synapse SQL offerings work with files in Psuedo RDBMS/SQL fashion there are expected T-SQL feature differences. This is due to the change in the underlying engine. Synapse SQL uses a node-based architecture. The Control node is the brain of this architecture. It's the front end that interacts with all applications and connections. The Azure Synapse SQL Control node utilizes a distributed query engine to optimize queries for parallel processing,. The Control node then passes these operations to Compute nodes to do their work in parallel. It's always good to visit this page for the updated list of platform differences.
Concept | Dedicated SQL Pool Support | Serverless SQL Pool Support |
Connections | Yes | Yes |
Resource classes and concurrency | Yes (Preferred than using default) | No |
Transactions | Yes | No |
User-defined schemas | Yes | Yes |
Table distribution | Yes | No |
Table indexes | Yes | No |
Table partitions | Yes | No |
Statistics | Yes | Yes |
CTAS | Yes | No |
External tables | Yes | Yes |
CETAS | Yes | Yes |
Stored procedures | Yes | Yes |
Labels | Yes | No |
Views | Yes | Yes |
Temporary tables | Yes | Yes |
Dynamic SQL | Yes | Yes |
Looping | Yes | Yes |
Group by options | Yes | No |
Variable assignment | Yes | Yes |
This list is often overlooked or missed during decision making. Total cost of engine shift (On-Premises or anywhere else to Azure) comprises of 2 aspects : direct cost benefits & cost for refactoring efforts. The general thumb rule here is if the cost of refactoring is greater than cost benefits you gain from choosing a particular service (with no actual current use case of additional features offered by the new engine), don't change the engine.
C) Purchasing Models
A Synapse SQL pool represents a collection of analytic resources that are being provisioned. Compute is separate from storage, which enables you to scale compute independently of the data in your system. Synapse SQL uses Azure Storage to keep your user data safe. Since your data is stored and managed by Azure Storage, there's a separate charge for your storage consumption.
The Compute nodes provide the computational power.
Serverless
Serverless SQL pool is serverless, hence there's no infrastructure to setup or clusters to maintain. A default endpoint for this service is provided within every Azure Synapse workspace, so you can start querying data as soon as the workspace is created. Its essentially a logical data warehouse that provides a relational abstraction on top of raw or disparate data.
Explicitly you don't have relocate and transform data before consuming allowing you to have an up-to-date view of your data as it is stored in workspace. Thanks to built-in query execution fault-tolerance, the system provides high reliability and success rates even for long-running queries involving large data sets.
Data Loading
Users have the following options to access storage:
Workloads
Today the only supported output types are Parquet and CSV as of when this article is published.
The data in the external table cannot be modified.
Pricing methodology
Its a “Pay-per-query” model or in other words you only pay for the amount of data that is processed. Serverless SQL pool will not incur charges until you run queries.
Dedicated
Dedicated SQL pool (formerly SQL DW) represents a collection of analytic resources that are provisioned when using Synapse SQL. Analytic resources are defined as a combination of CPU, memory, and IO. These three resources are bundled into units of compute scale called Data Warehouse Units (DWUs). The size of a dedicated SQL pool (formerly SQL DW) is determined by Data Warehousing Units (DWU). Dedicated SQL pool (formerly SQL DW) stores data in relational tables with columnar storage as compared to traditional T-SQL tables which store data in row storage.
- Row Store is when data that's logically organized as a table with rows and columns, and physically stored in a row-wise data format.
All of the attributes of a particular row are stored together in the same physical block of data.
Row store refers to a table where the underlying data storage format is a heap, a clustered index, or a memory-optimized table.
- Column Store is when data is data that's logically organized as a table with rows and columns, and physically stored in a column-wise data format.
A clustered column store index is the physical storage for the entire table.
You can create nonclustered B-tree or rowstore indexes as secondary indexes on a clustered columnstore index.
Data Loading
Dedicated SQL pool supports loading and exporting data through many loading methods. These include non-PolyBase options (BCP and SQLBulkCopy API), and PolyBase options. PolyBase is the fastest and most scalable Data Warehouse loading method to date, hence its recommend as your default loading mechanism.
To load data with PolyBase, you can use any of these loading options:
An external table is similar to a database view. Defining external tables involves specifying the data source, the format of the text files, and the table definitions. What follows are the T-SQL syntax topics that you'll need:
BCP loads directly to dedicated SQL pool without going through Azure Blob storage, and is intended only for small loads.
Workloads
Pricing methodology
Its a “Pay-by -DWU” model or in other words you're charged for it based on the number of DWU blocks and hours running. You can control costs for a dedicated SQL pool by pausing the resource when it is not is use.
Basic Purchasing Guidelines for both the Models
When you create or use Azure Synapse Analytics resources, you might get charged for the following meters:
After you delete Azure Synapse resources, the attached Data Lake Storage Gen2 might continue to exist. They continue to accrue costs until you delete them. Its always better to visit Azure Pricing Calculator and Azure Synapse Plan and Manage Cost documentation before arriving at a decision
D) Licensing
Azure Synapse brings together the best of
Synapse SQL uses a node-based architecture where you pay for computer and there is a separate charge for your storage consumption. There is no licensing cost if you use one or all of the above services listed.
In the next article Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 4 , we will cover the following aspects
E) Backup and Restore
F) High Availability
Followed by Azure SQL v/s Azure Synapse SQL (Dedicated Pool, Serverless Pool) - Part 5 that discusses the last key factor consideration for this series for G) Performance Tuning
Good Reads:
References:
Don't forget to share a if this helps
Credit: Thanks Ben Harding for review and guidance
FastTrack for Azure: Move to Azure efficiently with customized guidance from Azure engineering.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.