Fabric is extremely promising, but Synapse and Dedicated SQL Pools should be around for a while. I worked on dozens of Synapse Implementations. Some were great, others not so much, and would like to share some of my learnings.
Modern Data Warehouse Stories with Synapse SQL Pools:
Objective:
To Delta or not to Delta, that is the question:
Whether ’ tis nobler in SQL to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a blob of troubles.
The Joy
Characteristics:
Results:
The Hussle
Characteristics:
Results:
The Struggle
Characteristics:
Plan: Lift and shift ALL into Synapse
Results:
Dedicated Pool Internals
Tips to make another dedicated sql pool success story:
1) Decide when to use a clustered index vs column store.
If the access pattern is in the form of:
Select * from tbl where col = val
Or
Select * from tbl where col between A and B
Then use a clustered index.
If the table has 6 million rows or less also make it a clustered index, small tables are not eligible for compression.
If the table is large (>60 million rows) and the access pattern is aggregates over a subset of columns, then make it a column store.
For tables ranging between 6 and 60 million rows make the decision based on the access pattern.
2) Decide the distribution strategy for each table.
3) Work on data type improvements. Convert Nvarchar to Varchar when you don’t need special characters such as Chinese and Korean. For join and hash columns give preference to fixed length types like INT, BIGINT, NUMERIC. Remember string operations have collation overhear and that anything (max) prevents using column store.
4) Avoid small clusters for production environments, the true power of dedicated pools starts with 2 nodes at 1000 DWUs. If your budget only allows for 1 node, consider SQL Hyperscale (100 TB) or Managed Instance (16 TB).
The more you scale the more concurrent operations you can run, and the more resources large queries will have to run. Scaling up does not help compression/fragmentation problems with small inserts, and does not help when queries have a lot of data movement.
Use resource classes correctly, avoid Largerc/XLargerc for jobs that run frequently, avoid smallrc for jobs that load large amounts of data. When you don’t specify smallrc is used by default.
Bonus Data Lake Twist
Characteristics:
POC 1:
POC 2:
Result:
Delta Warning!
Increasing ELT frequency in a data lake presents challenges, especially with a lot of tables. Spark was not designed to handle many small operations concurrently. Even with high concurrency clusters the max capacity is 10 jobs in parallel. Jobs invoke tasks which hold cores exclusively for the durations of jobs. If you are familiar with CXPACKET waits in SQL, in Spark they are a lot more prevalent, leading to cores not utilized while waiting for other tasks to complete. In contrary SQL can handle dozens of concurrent jobs with a few cores, provided the operations don’t require much memory and don’t block each other.
The storage of delta tables may increase drastically with ETL frequency. Parquet, as a column store, is inherently immutable. When a merge happens over delta spark will typically first scan all files, keep a list of which ones have matching results, then recreate the files. If 1-bit changes, the entire parquet file needs to be re-written with potentially millions of bytes and minutes of CPU cycles to compress. The Json log will flag the new file as most recent, leaving the old one for time travel. Older files are removed with vacuum, which by default retains files for 30 days. In SQL modifications are done in place and in 8kb increments, despite overheads with fragmentation and transaction log, storage and CPU requirements are significantly lower for small operations.
The Spark Stream framework was created for append-only workloads, like temperature sensors or web logs. Streaming will deliver data into bronze efficiently, but many real-life workloads need to process merges/updates in silver/gold, and those face challenges with delta when dealing with a large amount of small operations. In contrast SQL offers rich change tracking capabilities (triggers, cdc, rowversion, replication) to facilitate development and improve efficiency of incremental updates.
With delta we have 2 good opportunities to make operations efficient. The first and most important is partition pruning, which narrows which folders will be processed, but requires a filter on the partition key. We can have only 1 partition column per table, and filtering on it is not always possible. For instance, if you partition an Orders table by SaleDate, and your ETL does a Merge on OrderID, you will likely scan the entire table. The second efficiency opportunity relies on zordering, which presorts data on individual parquet files. With SQL we can leverage non-clustered indexes to support a multitude of access patterns on the same table. These can be very useful to support small incremental batches, especially moving data across raw/silver/gold layers.
Distributed database engines are prone to data movement operations, which happen when nodes need data from other nodes. These operations can increase execution time drastically and are common on ETL, such as surrogate key conversions, joining tables, and doing aggregates. In Synapse Dedicated Pools you can help minimize data movement by planning hash distribution. In Spark there is no such feature, so data movement is more likely. In SQL there is no such problem because all data is attached to a single node.
Remember the differences between row vs column stores, how they work under the hood, what optimization they can leverage, and which workloads they are better suited for. When we talk about high velocity big data need, we mean about a lot of rows in a few tables, and what makes Spark efficient for those needs also makes it inefficient for handling many small batches. Delta uses parquet, which is a column store. In SQL you can decide when to use row vs column stores.
Data lake is an architecture pattern, and as such should be independent of technology. ADLS is has many advantages, specially dealing with unstructured and large volumes of data. Spark was created for big data problems (high Velocity, Volume, Variety). To handle high Vs well the designers made tradeoffs, which are great for big data, but not for small data. Spark is not the answer to every need, and neither is SQL. It is important to develop awareness of when to use each. In some cases SQL may be a better fit, and once the data is ready for reporting you can copy it to Dedicated SQL Pools (SQL Link to Synapse) or ADLS to handle large queries or ML.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.