FastTrack for Azure

7 MIN READ

Modern Data Warehouse Stories with Synapse

Microsoft

Jul 06, 2023

Fabric is extremely promising, but Synapse and Dedicated SQL Pools should be around for a while. I worked on dozens of Synapse Implementations. Some were great, others not so much, and would like to share some of my learnings.

Modern Data Warehouse Stories with Synapse SQL Pools:

A great story with an airline.
A borderline story with a medical research company.
A mistake with an insurance company.
A bonus Data Lake surprising happy ending.

Objective:

Look at characteristics of each story.
Glance at synapse internals to help understand results.
Review lessons learned to improve success.
Develop mindset to reason:

To Delta or not to Delta, that is the question:
Whether ’ tis nobler in SQL to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a blob of troubles.

The Joy

Characteristics:

SQL Workload, analytical, not leveraging column stores
Airline (reservation, ticket, luggage)
- Strategic Denormalization: add reservation_id to all large tables
- Common hash distribution
- Use reservation_id on joins –> minimal data movement!
- Ordered Column Stores –> fewer overlaps -> faster joins/merges
Few dozen tables, most small, a few large
Workload was mostly append-only, rare deletes, some updates and lots of inserts
- Reduced column_store compression issues
- Reduced overhead with and delta_store and delete_bitmap
- Improved use of adaptive cache
Use of 4 compute nodes (2000 DWUs)

Results:

Jobs up to 250x faster and 100x on average
Very happy customer pushing Synapse to different use cases
Low latency with multiple loads per hour

The Hussle

Characteristics:

SQL Workload
Medical Research
Hundreds of small tables and some medium
Normalized schema + Complex queries (nested views, functions, correlations, kvps) - > Lots of data movement!
Inefficient schema
- Nvarchar keys, thousands of Narchar (4000/Max) columns.
- Lots of redundant indexes (starting with same columns).
- Same column on different tables with different types
Single compute node (<=500 DWUs)
- All the pains of data distribution
- No gain from multi-node parallelism

Results:

Imports and Reports slower than SQL
Complaints
Feeling like we were not using the right tool for the job
Probably a well-tuned SQL would be better

The Struggle

Characteristics:

Large SQL Cluster
Insurance domain
100+ DBs, 100+ TBs
Mixed OLAP + OLTP workloads
Thousands of incompatibilities: cross server/db calls, constraints, triggers, filtered/included/unique indexes, XML/geography/hierarchy columns.

Plan: Lift and shift ALL into Synapse

Results:

Yearlong procurement
Multiple partner evaluations and legal work till contract
Months into development until FTA got involved.
REALIZATION PLAN IS NOT GONNA WORK

Dedicated Pool Internals

Tips to make another dedicated sql pool success story:

LOTS OF SMALL DATA IS NOT BIG DATA.
Distributing a large inefficient schema does more harm than good.
Dedicated Pool = Distributed SQL, if not good for SQL and does not distribute well, then it is NOT GOOD for Dedicated Pool!
Consider tuning SQL before distributing it, especially using column stores.
If you come from SQL start with Compatibility Checks Utility.
Azure Synapse Compatibility Checks - Microsoft Community Hub

GitHub - silvam-microsoft/SynapseCompatibilityChecks: Detects incompatibilities migrating from SQL Server to Synapse Dedicated SQL Pool

1) Decide when to use a clustered index vs column store.

If the access pattern is in the form of:

Select * from tbl where col = val

Or
Select * from tbl where col between A and B

Then use a clustered index.

If the table has 6 million rows or less also make it a clustered index, small tables are not eligible for compression.

If the table is large (>60 million rows) and the access pattern is aggregates over a subset of columns, then make it a column store.

For tables ranging between 6 and 60 million rows make the decision based on the access pattern.

2) Decide the distribution strategy for each table.

Replicate small tables which don't get updated very often (1x hour is ok, 1x minute not ok), and make sure to “select top 1 from tbl” after ETL, to kick off replication.
Classify large tables in function groups by which columns they have in common.
You should look for columns that are repeated across these tables with the same type, are mostly not null, and are not updatable. Hash distribute on the common columns and use it on joins.

3) Work on data type improvements. Convert Nvarchar to Varchar when you don’t need special characters such as Chinese and Korean. For join and hash columns give preference to fixed length types like INT, BIGINT, NUMERIC. Remember string operations have collation overhear and that anything (max) prevents using column store.

4) Avoid small clusters for production environments, the true power of dedicated pools starts with 2 nodes at 1000 DWUs. If your budget only allows for 1 node, consider SQL Hyperscale (100 TB) or Managed Instance (16 TB).

The more you scale the more concurrent operations you can run, and the more resources large queries will have to run. Scaling up does not help compression/fragmentation problems with small inserts, and does not help when queries have a lot of data movement.

Use resource classes correctly, avoid Largerc/XLargerc for jobs that run frequently, avoid smallrc for jobs that load large amounts of data. When you don’t specify smallrc is used by default.

Bonus Data Lake Twist

Characteristics:

Energy company
Multiple sources (SAP, IOT, Oracle)

POC 1:

Classic Data Lake
ADLS -> Serverless SQL Pool -> Power BI
Built 2 ELT stacks with Synapse Pipelines and Spark
Slight preference for Spark with daily imports (easier to write code than create data flows).
Increased import frequency to 15 minutes
- Explosion in compute and storage cost
- Jobs were taking almost 15 minutes for a few tables– concerns adding more

POC 2:

Use SQL MI as data lake storage (architecture pattern should not depend on technology)
99% of data is already structured and not big (few tbs)
Leverage TSQL knowledge and existing code
Column Store Indexes for large tables
Non-Clustered Indexes to support various access patterns
Temporal tables for time travel, rowversion for change capture (Raw/Silver/Gold)
Improved data quality by enforcing constraints (No duplicate and orphan rows)

Result:

Faster development and code execution
Lower cost and lower execution time
Lower Power BI latency with Direct Query as well as capacity savings
Enhanced Power BI capabilities with RLS, Masking
Improved Purview experience
Sql to Synapse Link for some tables to support large queries (no etl, gold tables only)

Delta Warning!

Increasing ELT frequency in a data lake presents challenges, especially with a lot of tables. Spark was not designed to handle many small operations concurrently. Even with high concurrency clusters the max capacity is 10 jobs in parallel. Jobs invoke tasks which hold cores exclusively for the durations of jobs. If you are familiar with CXPACKET waits in SQL, in Spark they are a lot more prevalent, leading to cores not utilized while waiting for other tasks to complete. In contrary SQL can handle dozens of concurrent jobs with a few cores, provided the operations don’t require much memory and don’t block each other.

The storage of delta tables may increase drastically with ETL frequency. Parquet, as a column store, is inherently immutable. When a merge happens over delta spark will typically first scan all files, keep a list of which ones have matching results, then recreate the files. If 1-bit changes, the entire parquet file needs to be re-written with potentially millions of bytes and minutes of CPU cycles to compress. The Json log will flag the new file as most recent, leaving the old one for time travel. Older files are removed with vacuum, which by default retains files for 30 days. In SQL modifications are done in place and in 8kb increments, despite overheads with fragmentation and transaction log, storage and CPU requirements are significantly lower for small operations.

The Spark Stream framework was created for append-only workloads, like temperature sensors or web logs. Streaming will deliver data into bronze efficiently, but many real-life workloads need to process merges/updates in silver/gold, and those face challenges with delta when dealing with a large amount of small operations. In contrast SQL offers rich change tracking capabilities (triggers, cdc, rowversion, replication) to facilitate development and improve efficiency of incremental updates.

With delta we have 2 good opportunities to make operations efficient. The first and most important is partition pruning, which narrows which folders will be processed, but requires a filter on the partition key. We can have only 1 partition column per table, and filtering on it is not always possible. For instance, if you partition an Orders table by SaleDate, and your ETL does a Merge on OrderID, you will likely scan the entire table. The second efficiency opportunity relies on zordering, which presorts data on individual parquet files. With SQL we can leverage non-clustered indexes to support a multitude of access patterns on the same table. These can be very useful to support small incremental batches, especially moving data across raw/silver/gold layers.

Distributed database engines are prone to data movement operations, which happen when nodes need data from other nodes. These operations can increase execution time drastically and are common on ETL, such as surrogate key conversions, joining tables, and doing aggregates. In Synapse Dedicated Pools you can help minimize data movement by planning hash distribution. In Spark there is no such feature, so data movement is more likely. In SQL there is no such problem because all data is attached to a single node.

Remember the differences between row vs column stores, how they work under the hood, what optimization they can leverage, and which workloads they are better suited for. When we talk about high velocity big data need, we mean about a lot of rows in a few tables, and what makes Spark efficient for those needs also makes it inefficient for handling many small batches. Delta uses parquet, which is a column store. In SQL you can decide when to use row vs column stores.

Data lake is an architecture pattern, and as such should be independent of technology. ADLS is has many advantages, specially dealing with unstructured and large volumes of data. Spark was created for big data problems (high Velocity, Volume, Variety). To handle high Vs well the designers made tradeoffs, which are great for big data, but not for small data. Spark is not the answer to every need, and neither is SQL. It is important to develop awareness of when to use each. In some cases SQL may be a better fit, and once the data is ready for reporting you can copy it to Dedicated SQL Pools (SQL Link to Synapse) or ADLS to handle large queries or ML.

Updated Jul 27, 2023

Version 4.0

data & ai

MarceloGSilva

Microsoft

Joined April 20, 2021

View Profile

FastTrack for Azure

Follow this blog board to get notified when there's new activity

Blog Post

Modern Data Warehouse Stories with Synapse