Motivation:
One of the main points of database architecture is to choose the correct data platform that meets your needs. In my lengthy career as data architect and data engineer I’ve seen a lot of cases when the wrong choice is selected.
Usually, articles are trying to promote some platform, but it is more complex to find articles that prevent you from using excellent technologies that are not a good fit to your needs, to your budget or even to both your needs and your budget.
This article might be followed by other articles on other data platforms.
The goal is to help you to understand when Synapse’s distributed tables (on dedicated or serverless pools) are not a desirable choice by giving you first guidance rules followed by deep explanations and pointers.
Take Away (truly short):
- Synapse distributed tables should be used when you have more than 60M records per archive period.
AND
- You are loading data in batches of 6M or above.
Take away (archive and performance):
- If you need to archive data – each partition should have at least 60M rows.
For example, if you archive data monthly, you want to have monthly partition which means 60M rows per month. If you have only 5M rows per month, you might think to use yearly partition, but this will be wrong. - Load Performance and Query Performance - While you are loading new data, your new data might end up in the delta rowgroup (see below).
With Synapse this will happen till you load 6,144 million rows (102,400 rows per distribution).
This will cause performance issues since the (60 or more) delta rowgroups are based on heap tables. You can minimize the impact by rebuilding indexes.
Alternatives:
- If your table is not too big – you can use a replicated table. Replicated tables work best when the table size is less than 2 GB compressed. 2 GB is not a hard limit. If the data is static and does not change, you can replicate larger tables.
- If your table is big but your loads are not – you can use a replicated staging table for the recent (e.g., daily) load and merge the staging into the big distributed table.
You should use a VIEW that UNION the replicated and the distributed tables (if no duplicates can happen you can use UNION ALL which might save an expensive sort). - Not always SQL pool is your best solution you might use Azure SQL database, Azure SQL Managed Instance or Azure SQL Hyperscale. Those solutions can be stand alone or integrated in a Synapse Workspace environment.
Additional reminder/limitations – columnar in general:
Those take away are specific to Synapse distributed tables that are based on columnar indexes. Columnar indexes are not suitable for workloads that have a lot of updates since rowgroups must be recalculated in such cases.
Further explanations:
If you are reading this paragraph, I assume you want to understand more about the technical justifications of the recommendations above. I tried to collect the recommendations from the various documents with links that will help you learn more.
Synapse Distributed Table Structure:
- Each distributed table (not replicated table) has by definition 60 distributions and columnar index on this table. This is before partitioning!!
- Any partitioning added to a table is in addition to the distributions created behind the scenes. If a sales fact table contained 36 monthly partitions and given that a dedicated SQL pool has 60 distributions, then the sales fact table should contain 60 million rows per month, or 2.1 billion rows when all months are populated. If a table contains fewer than the recommended minimum number of rows per partition, consider using fewer partitions in order to increase the number of rows per partition (see here)
- Each distribution has one or more delta rowgroup:
- A delta rowgroup is a clustered B-tree index that is used only with columnstore indexes. It improves columnstore compression and performance by storing rows until the number of rows reaches a threshold (1,048,576 rows) and are then moved into the columnstore (see here).
- During a large bulk load, most of the rows go directly to the columnstore without passing through the deltastore. Some rows at the end of the bulk load might be too few to meet the minimum size of a rowgroup, which is 102,400 rows. As a result, the final rows go to the deltastore instead of the columnstore. For small bulk loads with less than 102,400 rows, all of the rows go directly to the deltastore (see here).
- The above is true for both hash-distributed and round-robin distributed tables.
Replicated tables:
The replicated tables functionality is well explained here. I will highlight important limitations that you want to consider before choosing replicated tables
- The size of the table is above 2GB or so.
- The table has frequent insert, update, and delete operations.
- The SQL pool is scaled frequently.
- The table has many columns, but data operations typically access only a small number of columns.
Further reading:
Now, some links to help you master Synapse’s data modelling:
- Partitioning tables in dedicated SQL pool - Azure Synapse Analytics | Microsoft Docs
- Design guidance for replicated tables - Azure Synapse Analytics | Microsoft Docs
- Create and update statistics on tables - Azure Synapse Analytics | Microsoft Docs
- Best practices for dedicated SQL pools - Azure Synapse Analytics | Microsoft Docs
- Temporary tables - Azure Synapse Analytics | Microsoft Docs
- https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-memory-optimizations-for-columnstore-compression
- https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-index#optimizing-clustered-columnstore-indexes