Azure Synapse Analytics March Update 2022
Welcome to the March 2022 Azure Synapse update! This month, we have SQL, Apache Spark for Synapse, Security, Data integration, and Notebook updates for you. Watch our monthly update video!
You can see the rest of our videos on the Azure Synapse Analytics YouTube channel. For all the latest updates and discussions, follow us on Azure Synapse Analytics (@Azure_Synapse) / Twitter.
Put your questions and feedback in the comments below!
We hear from you that it's difficult to debug errors in notebooks because exception messages are shown in the cell output but anything you write to stdout is not shown. This can be especially frustrating because writing to stdout is a popular technique for debugging. In the examples below, you can see anything written to stdout is not visible in the cell output.
Now in Synapse notebooks, both standard output and exception messages are shown when a code statement fails for Python and Scala languages. In the examples below, you can see that the word "test" written to stdout is visible in the cell output.
Developers often rely on printing standard output messages while performing a long operation. Previously, this output was only available to you once the code cell execution was complete. You could not see the output during the execution.
Now, you can see anything you write as the cell executes instead of waiting until it ends.
Notebooks are often run via the notebook activity in a pipeline. Sometimes, you need to tweak the session configuration when it is running in a pipeline. Previously, the configuration for those notebooks' Spark sessions would rely on default settings. Now, you can use pipeline parameters to configure the session with the notebook %%configure magic.
In the screenshot below, you can see there are 2 parameters defined for this notebook activity: driverCoresFromNotebookActivity and rows. The values of these parameters will be available to the notebook.
To use the values in the notebook, use the %%configure magic and use the parameter names from the activity (driverCoresFromNotebookActivity and rows). To handle the case when the parameters are not specified in the activity, you can provide default values.
For more detailed usage, read Parameterized session configuration from pipeline
Developers, as they are working with notebooks, often want to create a new notebook but then want to reuse the active session from their old notebook. Previously, the session could not be reused or transferred to the notebook, so you had to spend time spinning up a new session for your notebook. In this update, we have made it easy to reuse an active session conveniently without having to start a new one and to see and manage your active sessions in the “Active sessions” list.
To view your sessions, click on the 3 dots in the notebok and click "Manage sessions."
When you click on "Manage sessions", it will bring up the "Active sessions" list. In the sessions list, you can see all the sessions in the current workspace started by you from notebook and view the session information and the corresponding notebook that is currently attached to the session. You can operate Detach with notebook, Stop the session, and View in monitoring from here. You can also easily connect your selected notebook to an active session in the list started from another notebook. The session will be detached from the previous notebook (if it’s not idle) and attach to the current one.
Python developers often use the logging module, which is part of the Python Standard Library. The logging module provides very powerful diagnostic capabilities. It supports log levels, filters, and formatters.
Previously, Synapse did not capture any of the output from the logging module. But with this release, Synapse now captures anything written through the Python logging module, in addition to the driver logs.
The screenshot below shows how logs from the logging module are captured from the cell output. These logs are stored as part of the driver logs.
Column Level Encryption for Azure Synapse SQL is now Generally Available. Protect sensitive information in your Synapse SQL data warehouse using column-level encryption (CLE). CLE helps you implement fine-grained protection of sensitive data within a table. With CLE, you can use different protection keys for each column with each key having its own access permissions. The data in CLE-enforced columns are encrypted on disk and remain encrypted in memory until the DECRYPTBYKEY function is used to decrypt it.
This feature is for use on new and existing Azure SQL logical servers with Azure Synapse SQL pools, as well as the dedicated pools in Azure Synapse workspaces.
Note that SSDT tooling support for CLE for the dedicated SQL pools will be available in the 17.2 Preview 2 build of Visual Studio 2022.
To learn more about Column Level Encryption and see code samples, read Encrypt a Column of Data.
You can transform data in the lake using the serverless SQL pool CETAS (Create External Table as Select) statement. With recent performance improvement, you get better performance when transforming smaller data scales as well as better performance of subsequent SELECTs from the created external table.
Serverless SQL pools execute queries in a distributed manner, spreading workload on multiple nodes for faster execution. Each node executes a T-SQL fragment of a user-submitted query, processing a subset of data. These queries are regular T-SQL queries that might have a serial or parallel execution plan based on cost.
Until recently, the amount of data to be processed by a single node and query complexity could result in low cost and node executing the query using a serial execution plan. Serial execution plan on a node results in the node outputting a single file:
With recent improvements, a parallel execution plan is used, resulting in faster CETAS execution and outputting multiple files:
An extreme case was when the SELECT part of the CETAS statement was targeting only one file or a small amount of data in multiple files, meaning only one node was leveraged to execute the query using a serial execution plan, outputting a single file. Subsequent SELECTs from the created external table could not benefit from parallelism because only one file can be processed by one node:
Now, CETAS will create multiple files, enabling subsequent SELECTs to process the generated files in parallel for faster query execution:
Learn more by reading CETAS
The Synapse Spark Common Data Model (CDM) format reader/writer enables a Spark program to read and write CDM entities in a CDM folder via Spark dataframes.
The Synapse Spark CDM Connector is now Generally Available. At a high level, the following capabilities are supported:
For information on defining CDM documents using CDM 1.0, see documentation on the Common Data Model.
Learn more by reading Apache Spark CDM Connector.
The Azure Synapse Dedicated SQL Pool Connector for Apache Spark efficiently transfers large volume data sets between the Apache Spark runtime and the Dedicated SQL pool. It has been re-architected for improved performance. The new architecture eliminates redundant data movement and leverages COPY-INTO instead of PolyBase.
For authentication, you can choose to connect with Synapse Dedicated SQL Pool server endpoints via SQL basic authentication (username, password, and storage account key) or opt into the Azure Active Directory/AAD-based authentication method (when a storage account key is not required). Unlike CETAS (Create External Table as Select), COPY-INTO has specific and explicit authentication credential dependencies. With AAD, you don't have to provide one. It is inferred from the user's identity's scope.
Performance tests indicated at least ~5x improvement over the previous version. No action is required from the user to leverage these enhancements.
Learn more by reading Azure Synapse Dedicated SQL Pool Connector for Apache Spark.
The Synapse Spark Dedicated SQL Pool (DW) Connector now supports all four Spark Dataframe SaveMode choices: Append, Overwrite, ErrorIfExists, Ignore. Append and Overwrite are critical for managing data ingestion at scale. For more information on Spark SaveMode, read the Apache Spark Documentation.
When saving a dataframe (my_df) to a source:
To learn more, read Azure Synapse Dedicated SQL Pool Connector for Apache Spark.
Synapse now offers its own caching mechanism for spark pools. Intelligent cache works seamlessly behind the scenes to cache data to speed up the execution of Spark. Unlike Spark’s native caching feature, Intelligent Cache automatically stores each read within the allocated cache storage space, detecting underlying file changes and refreshing the files to provide the most recent data. This feature lowers the total cost of ownership by improving performance up to 65% on subsequent reads of the files that are stored in the available cache for Parquet files and 50% for CSV files.
When querying a file or table from your data lake, the Apache Spark engine in Synapse will make a call to the remote ADLS Gen2 storage to read the underlying files. With every query request to read the same data, the Spark engine must make a call to the remote ADLS Gen2 storage. This redundant process adds latency to your total processing time. Spark provides a caching feature that you must manually set the cache and release the cache to minimize the latency and improve overall performance. However, this can cause results to have stale data if the underlying data changes.
Intelligent Cache simplifies this process by automatically caching each read within the allocated cache storage space on each Spark node. Each request for a file will check to see if the file exists in the cache and compare the tag from the remote storage to determine if the file is stale. If the file doesn't exist or if the file is stale, then Spark will read the file and store it in the cache. When the cache becomes full, the file with the oldest last access time will be evicted from the cache to allow for more recent files.
Enable/Disable the cache for your Apache Spark pool
The cache size can be adjusted based on the percentage of total disk size available for each Apache Spark pool. By default, the cache is set to disabled. Enabling it is as easy as moving the slider bar from 0 (disabled) to the desired percentage for your cache size. A minimum of 20% of the available disk space is reserved for data shuffles. For shuffle-intensive workloads, you can minimize the cache size or disable the cache. It is recommended that you start with a 50% cache size and adjust as necessary. If your workload requires a lot of disk space on the local SSD for shuffle or RDD caching, consider reducing the cache size to lower the chance of failure due to insufficient storage. The actual size of the available storage and the cache size on each node will depend on the node family and node size.
Enabling cache for new Spark pools
When creating a new Spark pool, browse under the additional settings tab to find the Intelligent cache slider. Move this to your preferred size to enable the feature.
Enabling/Disabling cache for existing Spark pools
For existing Spark pools, browse to the Scale settings of your Apache Spark pool of choice. To enable, move the slider to a value more than 0. To disable, move the slider to 0.
Changing cache size for existing Spark pools
To change the Intelligent Cache size of a pool, you must force a restart if the pool has active sessions. If the Spark pool has an active session, you will see an option to “Force new settings”. Click on the check box and select “Apply” to automatically restart the session.
Once the session is restarted, you can adjust the size using the slider.
Enabling and disabling the cache within the session
Easily enable and disable the Intelligent cache within a session by running the following code in your notebook:
To learn more, read Intelligent Cache in Azure Synapse Analytics
You can now use Azure Active Directory authentication to centrally manage access to all Azure Synapse resources, including SQL pools. While SQL pools in Azure Synapse will still support the creation of local users, you can turn off this capability during or after Azure Synapse analytics workspace creation. You can also benefit from this feature if you do not want to manage local accounts.
To disable local authentication in Synapse during workspace creation, select 'Use only Azure Active Directory (Azure AD) authentication' as the authentication method. A SQL Administrator login will still be created, but it will be disabled. Local authentication can be enabled later by an Azure Owner or Contributor of the Synapse workspace.
You can also disable local authentication after a workspace is created through the Azure Portal. Local authentication cannot be disabled until an Azure Active Directory admin is created for the Azure Synapse workspace.
Learn more by reading Active Directory Authentication
You can now raise or lower the workspace managed SQL server dedicated SQL minimal TLS version using an API for both new and existing Synapse workspaces. If you were previously using a lower client version in the new workspace, which enforces min1.2, you can now lower the TLS version. If you have an existing workspace, you can now raise the minTLS version to meet your security needs.
While serverless SQL endpoints and development endpoints only accept TLS 1.2 and above, workspace Managed SQL Server Dedicated SQL allows connections using all TLS versions. Starting in December 2021, a requirement for TLS 1.2 has been implemented for new Synapse Workspaces only. Login attempts to the newly created Synapse workspace from connections using a TLS version lower than 1.2 will fail.
Learn more by reading minTLS REST API
Data engineers can now use the features in Synapse Data Flows that allow for reusable and composable ETL logic called Flowlets, as well as the change capture capabilities of Cosmos DB, Blob Store, Azure SQL DB, ADLS Gen1, ADLS Gen2, and CDM sources in data flows, as generally available features ready for production workloads. With the change capture features, Synapse pipelines and data flows will handle the management of a checkpoint for you. All you have to do is check the “enable change feed” checkbox and, every time that the pipeline executes, only the inserted or updated rows will be read by the pipeline.
New this month is a native sftp connector in Synapse data flows. Now, you can read and write data while transforming data from sftp using the visual low-code data flows interface in Synapse. Use this in your ETL pipelines when your source data is stored in a sftp site where you must join, aggregate, or otherwise transform the data before writing it back.
Learn more by reading Source Transformation.
A few new and exciting features were added to Data Preview in data flows will make the job of data engineers building ETL jobs in Synapse much easier!
Learn more by reading Data Preview and Debug Improvements in Mapping Data Flows.
We’ve introduced a brand-new activity type to Synapse pipelines! The first language we’ve enabled in the activity is SQL to support database types like Synapse SQL. The Script Activity enables data engineers to build very powerful data integration pipelines that can read from and write to Synapse databases, as well as other database types. With support for both DDL & DML commands, you can use the script inline in your pipelines to perform actions like logging or table lookups. You can also modify table schemas and update rows using the SQL Script activity.
Learn more by reading Transform data by using the Script activity in Azure Data Factory or Synapse Analytics
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.