Azure Synapse Analytics Blog

7 MIN READ

Azure Synapse Analytics September Update 2022

Microsoft

Sep 28, 2022

Azure Synapse Analytics September Update 2022

Welcome to the September 2022 update for Azure Synapse Analytics! This month, you will find information about MERGE for dedicated SQL pools, Auto-statistics for CSV datasets in serverless SQL and the new Livy errors codes for Apache Spark in Azure Synapse Analytics. Additional general updates and new features in Synapse Data Explorer and Data Integration are also mentioned in this edition.

Don’t forget to check out our companion video on the Azure Synapse Analytics YouTube channel!

SQL
- Auto-statistics for OPENROWSET in CSV datasets
- MERGE T-SQL is now Generally Available
Apache Spark for Synapse
- New informative Livy errors codes
Synapse Data Explorer
Data Integration

SQL

Auto-statistics for OPENROWSET in CSV datasets

Serverless SQL pool relies on statistics to generate an optimal query execution plan and resource allocation. Previously, you had to manually create statistics for your CSV datasets when querying with OPENROWSET to get optimal performance due to the lack of CSV auto-statistics feature. With this improvement, serverless SQL pool will automatically create statistics for CSV datasets when needed to ensure an optimal query execution plan is generated. This feature is currently enabled for OPENROWSET only.

How much performance improvement CSV statistics will bring depends on query complexity and the amount of data processed. The following chart compares durations of TPCH queries on top of 10MB and 1TB CSV datasets with and without statistics:

In both cases, improvement is ~25% with no additional user effort. With this improvement, statistics will be created automatically for your CSV datasets queried with OPENROWSET. That way you can enjoy faster queries over CSV files!

To learn more, please read Statistics in serverless SQL pool.

MERGE T-SQL is now Generally Available

We are happy to announce that the MERGE T-SQL command for Azure Synapse Dedicated SQL pools is now Generally Available! MERGE has been a highly requested addition to the Synapse T-SQL library that encapsulates INSERTs/UPDATEs/DELETEs into a single statement, drastically easing migrations and making table synchronization a breeze.

To ensure you are using the official GA-supported version of MERGE, check that the ‘@@version’ of your Synapse Dedicated SQL pool is on ‘10.0.17829.0’ or beyond. To learn more about the MERGE statement in Synapse SQL Dedicated pools, check out MERGE (Transact-SQL) and the Tech Community Blogpost.

Apache Spark for Synapse

New informative Livy errors codes

Many factors can play into the failure of a Spark application in Azure Synapse Analytics. For example, it can be due to a system error or a user related error. Previously, all errors in failing Spark jobs surfaced with a generic error code displaying LIVY_JOB_STATE_DEAD. This error code gave no further insight into why the job had failed and required significant effort to identify the root cause by digging into the driver, executor, Spark Event and Livy logs to find a solution.

To make troubleshooting errors easier, we have introduced a more precise list of error codes that describes the cause of failure and replaces the previous generic error codes. When a Spark job fails on Azure Synapse, the error handling feature now parses and checks the logs on the backend to identify the root cause and displays it on the monitoring pane along with the steps to take to resolve the issue.

Error codes are now divided into four different categories:

User: Indicating a user error
System: Indicating a system error
Ambiguous: Could be either user or system error
Unknown: No classification yet, most probably because the error type isn't included in the model

Note that this feature is currently disabled by default but will be turned on by default for all Spark application shortly.

For more details, please read Interpret error codes in Synapse Analytics

Synapse Data Explorer

Logstash connector HTTP/HTTPS proxy configuration

Logstash is an open-source server-side data processing pipeline that ingests data from a multitude of sources, transforms it, and then sends it to your favorite destination.

The Azure Data Explorer (ADX) Logstash plugin enables you to process events from Logstash into an Azure Data Explorer database for later analysis. The latest release, v1.0.5, now supports HTTP and HTTPS proxies specified through configuration.

Kafka Connect support of Protobuf format

Using Protocol buffers is increasingly becoming a data exchange choice in IoT based systems because of the excellent schema support, extensibility to various platforms and compression. The ADX Kafka sink connector leverages the Kafka Connect framework and provides an adapter to ingest data from Kafka in JSON, Avro, String and now Protobuf format in the latest update.

To learn more, please read Ingesting Protobuf data from Kafka to Azure Data Explorer

Embed Azure Data Explorer dashboards in third-party apps

Azure Data Explorer dashboards can now be embedded in 3rd party apps. This comes on top of allowing embedding of the Monaco editor in 3rd party apps.

Dashboard embedding allows you to easily share data with your customers in a way that allows them to interact and explore it. Using various feature flags, you can manage what controls will be part of the embedded dashboard experience. For example, you can decide to remove sharing, and add connection menu items.

To learn more about dashboard embedding in Data Explorer, read Embed dashboards.

Funnel visuals

Funnel is the latest visual we added to Azure Data Explorer dashboards following the feedback we received from customers. Funnel is usually used to track sequential flows to analyze attrition from stage to stage.

For more details, read Customize Dashboard Visuals - Funnel

.NET and Node.js support in Sample App Generator

The sample app generator wizard is a tool that allows you to create a working app to ingest and query your data in your preferred programming language. You can use this generated code as a baseline to write your own apps, altering the code as you go, or you can copy sections of the code into your own apps. The code includes comments with tips, best practices, links to reference material, and recommended TODO changes when adapting the code to your needs.

Starting today, generating sample apps in .NET and Node.js is supported along with the previously available options – Java & Python. Give it a try here.

To learn more, please read Use the sample app generator wizard to create code to ingest and query your data.

Data Integration

Gantt Chart view supported in Integration Runtime Monitoring

You now have the option to view your activity runs with a Gantt chart in Integration Runtime monitoring. A Gantt chart is a view that allows you to see the run history over a time range. This will give you additional monitoring views of your integration runtimes.

For more information, read Monitor integration runtime in Azure Data Factory.

Monitoring Improvements

We’ve released a new bundle of improvements to the monitoring experience based on community feedback.

Sorting: We now have more columns available for sorting. Previously, you could only sort by Run start. Now, you can sort by Pipeline Name, Run End, and Status. In addition to that, you can also sort by Trigger Name and Status when viewing triggers.

Pipeline Consumption Report: The pipeline consumption report now shows a loading icon when there are many activities in a pipeline to let you know when the consumption information has been fully calculated.

Rerun Pipeline Grouping: Previously, when pipelines were re-run, all re-runs were retrieved for the monitoring UI, even when the re-runs were not viewed. Now, pipeline groupings for re-runs will only fetch that data when the grouping is expanded. This will improve the performance of monitoring pages that have multiple pipeline re-run groupings.

Time zone settings: Monitoring time zone settings are now saved in your browser cache with your configured columns so that you don’t have to reconfigure your settings each time.

Pipeline runs: When monitoring pipeline runs, all records for pipelines with many activities will be retrieved. We’ve now added pages so that you can navigate through all the activity records.

To learn more about these updates, read Azure Data Factory monitoring improvements.

Maximum column optimization in mapping dataflow

For CSV data sources, we’ve added a new setting that allows you to set the maximum number of columns. The default is 20480 so if you have more than 20480 columns, you can customize this value and optimize it for your dataset.

To learn more about this setting, refer to this document: Delimited text format in Azure Data Factory.

Support NUMBER to integer conversion in Oracle source connector

We have added new support to convert Oracle NUMBER type to corresponding integer in source. You will see a new property convertDecimalToInteger, which will allow you to convert Oracle NUMBER type from zero or an unspecified scale to the corresponding integer.

To learn more about this update, read Copy data to and from Oracle by using Azure Data Factory or Azure Synapse Analytics.

Support for sending a body with HTTP request DELETE method in Web activity

We have added support for sending a body (optional) when using the DELETE method in Web activity. This is useful in the case where your web endpoint expects a body to be passed with the HTTP request DELETE method.

To learn more, refer to Web Activity in Azure Data Factory and Azure Synapse Analytics.

Updated Sep 28, 2022

Version 2.0

Monthly Update

ryanmajidi

Microsoft

Joined December 03, 2020

View Profile

Azure Synapse Analytics Blog

Follow this blog board to get notified when there's new activity

6 Comments

mikecarlo
Copper Contributor
Sep 29, 2022
Also I’m looking for better online notebook error handling when a notebook is executed in a synapse pipeline. Again databricks has a much better notebook debugging experience here.

I feel like this is also a show stopper that needs to be addressed for more adoption of synapse.

Any plans on enhancing the debugging experience?
mikecarlo
Copper Contributor
Sep 29, 2022
Not seeing any updates to the spark runtime engine. It’s very difficult to run multiple notebooks on a cluster because you have to reserve cores. Are there any plans to change this behavior to run multiple notebooks against a cluster similarly like how databricks manages notebooks running on a cluster?

Without this feature parody I can’t migrate my workloads over to spark synapse. Any thoughts on when this enhancement will appear?
dcastell1
Copper Contributor
Sep 29, 2022
Hello ryanmajidi

Thank you again for your monthly summary. I always read this posts with excitement, keep up the good work. Hoping to see improvements in Delta Lake support for the serverless SQL pool and also the Spark Notebooks in next updates.

Cheers!
johanali
Copper Contributor
Sep 29, 2022
Azure Synapse Analytics is an integration layer on top of the Azure Synapse SQL Pool. The additional layer is in public preview mode, where the SQL Pools (Azure Synapse) are mature and production ready.
josemanuelmolina
Copper Contributor
Sep 29, 2022
Hello ryanmajidi ,

Many thanks for sharing and happy to see the continous investment from Microsoft in Synapse.
Every time I get the monthly upate I'm looking for the Global variable support in the Synapse Workspace which is available in the ADF.
Wondering why this has not yet included in Synapse as it is quite useful feature which is been requested long time ago by the community.
https://feedback.azure.com/d365community/idea/eaa47674-0442-ec11-a819-000d3ae2b5ca

Cheers.
JamesvandenBerg
MVP
Sep 28, 2022
Thank you ryanmajidi for sharing these updates with the Community