Synapse feature limits - review

Microsoft

Sep 14, 2020

Another day I got this case about Synapse feature limitation. The customer was not sure about the information found on the documentation.

So the idea here is a quick review about the documentation.

Spark Limitations:

When you create a Spark Pool you will be able to define how much resources your pool will have as you can check it here: https://docs.microsoft.com/en-us/azure/synapse-analytics/quickstart-create-apache-spark-pool-portal

If you may want to define a CAP for the session that you are executing and that is the example: https://techcommunity.microsoft.com/t5/azure-synapse-analytics/livy-is-dead-and-some-logs-to-help/ba-p/1573227

In other words, you may have one notebook taking over all your resources or you may have 5 notebooks running at the same time, looking though this way there is not a fixed limit as you would see on ADW. Spark is a pool of resources.

This one I guess it is better to get an idea: https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-concepts

"It is the definition of a Spark pool that, when instantiated, is used to create a Spark instance that processes data. When a Spark pool is created, it exists only as metadata; no resources are consumed, running, or charged for. A Spark pool has a series of properties that control the characteristics of a Spark instance; these characteristics include but are not limited to name, size, scaling behavior, time to live."

Thanks to my colleague Charl Roux for the discussion about Spark.

ADF Pipeline:

This one is clear on the documentation:

:https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities

concurrency

The maximum number of concurrent runs the pipeline can have. By default, there is no maximum. If the concurrency limit is reached, additional pipeline runs are queued until earlier ones complete

About SQL OD:

(https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview)

"SQL on-demand is serverless, hence there is no infrastructure to setup or clusters to maintain. A default endpoint for this service is provided within every Azure Synapse workspace, so you can start querying data as soon as the workspace is created. There is no charge for resources reserved, you are only being charged for the data scanned by queries you run, hence this model is a true pay-per-use model."

So basically here we have another scenario of not a documented limit because there is no fixed limit. You could 300 small queries running or you could have the one query running alone and using all resources while the others wait. SQL on-demand has a Control node that utilizes a Distributed Query Processing (DQP) engine to optimize and orchestrate distributed execution of user query by splitting it into smaller queries that will be executed on Compute nodes. In SQL on-demand, each Compute node is assigned a task and set of files to execute the task on. The task is a distributed query execution unit, which is actually part of the query user submitted. Automatic scaling is in effect to make sure enough Compute nodes are utilized to execute user queries.

As for ADW the limits are pretty much clear on the documentation and it is tied to the Service Levels:

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/memory-concurrency-limits

That is it!

Liliam C Leme

UK Engineer

Updated Sep 14, 2020

Version 2.0

Microsoft