Welcome to the September 2022 update for Azure Synapse Analytics! This month, you will find information about MERGE for dedicated SQL pools, Auto-statistics for CSV datasets in serverless SQL and the new Livy errors codes for Apache Spark in Azure Synapse Analytics. Additional general updates and new features in Synapse Data Explorer and Data Integration are also mentioned in this edition.
Don’t forget to check out our companion video on the Azure Synapse Analytics YouTube channel!
Serverless SQL pool relies on statistics to generate an optimal query execution plan and resource allocation. Previously, you had to manually create statistics for your CSV datasets when querying with OPENROWSET to get optimal performance due to the lack of CSV auto-statistics feature. With this improvement, serverless SQL pool will automatically create statistics for CSV datasets when needed to ensure an optimal query execution plan is generated. This feature is currently enabled for OPENROWSET only.
How much performance improvement CSV statistics will bring depends on query complexity and the amount of data processed. The following chart compares durations of TPCH queries on top of 10MB and 1TB CSV datasets with and without statistics:
In both cases, improvement is ~25% with no additional user effort. With this improvement, statistics will be created automatically for your CSV datasets queried with OPENROWSET. That way you can enjoy faster queries over CSV files!
To learn more, please read Statistics in serverless SQL pool.
We are happy to announce that the MERGE T-SQL command for Azure Synapse Dedicated SQL pools is now Generally Available! MERGE has been a highly requested addition to the Synapse T-SQL library that encapsulates INSERTs/UPDATEs/DELETEs into a single statement, drastically easing migrations and making table synchronization a breeze.
To ensure you are using the official GA-supported version of MERGE, check that the ‘@@version’ of your Synapse Dedicated SQL pool is on ‘10.0.17829.0’ or beyond. To learn more about the MERGE statement in Synapse SQL Dedicated pools, check out MERGE (Transact-SQL) and the Tech Community Blogpost.
Many factors can play into the failure of a Spark application in Azure Synapse Analytics. For example, it can be due to a system error or a user related error. Previously, all errors in failing Spark jobs surfaced with a generic error code displaying LIVY_JOB_STATE_DEAD. This error code gave no further insight into why the job had failed and required significant effort to identify the root cause by digging into the driver, executor, Spark Event and Livy logs to find a solution.
To make troubleshooting errors easier, we have introduced a more precise list of error codes that describes the cause of failure and replaces the previous generic error codes. When a Spark job fails on Azure Synapse, the error handling feature now parses and checks the logs on the backend to identify the root cause and displays it on the monitoring pane along with the steps to take to resolve the issue.
Error codes are now divided into four different categories:
Note that this feature is currently disabled by default but will be turned on by default for all Spark application shortly.
For more details, please read Interpret error codes in Synapse Analytics
Logstash is an open-source server-side data processing pipeline that ingests data from a multitude of sources, transforms it, and then sends it to your favorite destination.
The Azure Data Explorer (ADX) Logstash plugin enables you to process events from Logstash into an Azure Data Explorer database for later analysis. The latest release, v1.0.5, now supports HTTP and HTTPS proxies specified through configuration.
Using Protocol buffers is increasingly becoming a data exchange choice in IoT based systems because of the excellent schema support, extensibility to various platforms and compression. The ADX Kafka sink connector leverages the Kafka Connect framework and provides an adapter to ingest data from Kafka in JSON, Avro, String and now Protobuf format in the latest update.
To learn more, please read Ingesting Protobuf data from Kafka to Azure Data Explorer
Azure Data Explorer dashboards can now be embedded in 3rd party apps. This comes on top of allowing embedding of the Monaco editor in 3rd party apps.
Dashboard embedding allows you to easily share data with your customers in a way that allows them to interact and explore it. Using various feature flags, you can manage what controls will be part of the embedded dashboard experience. For example, you can decide to remove sharing, and add connection menu items.
To learn more about dashboard embedding in Data Explorer, read Embed dashboards.
Funnel is the latest visual we added to Azure Data Explorer dashboards following the feedback we received from customers. Funnel is usually used to track sequential flows to analyze attrition from stage to stage.
For more details, read Customize Dashboard Visuals - Funnel
The sample app generator wizard is a tool that allows you to create a working app to ingest and query your data in your preferred programming language. You can use this generated code as a baseline to write your own apps, altering the code as you go, or you can copy sections of the code into your own apps. The code includes comments with tips, best practices, links to reference material, and recommended TODO changes when adapting the code to your needs.
Starting today, generating sample apps in .NET and Node.js is supported along with the previously available options – Java & Python. Give it a try here.
To learn more, please read Use the sample app generator wizard to create code to ingest and query your data.
You now have the option to view your activity runs with a Gantt chart in Integration Runtime monitoring. A Gantt chart is a view that allows you to see the run history over a time range. This will give you additional monitoring views of your integration runtimes.
For more information, read Monitor integration runtime in Azure Data Factory.
We’ve released a new bundle of improvements to the monitoring experience based on community feedback.
Sorting: We now have more columns available for sorting. Previously, you could only sort by Run start. Now, you can sort by Pipeline Name, Run End, and Status. In addition to that, you can also sort by Trigger Name and Status when viewing triggers.
Pipeline Consumption Report: The pipeline consumption report now shows a loading icon when there are many activities in a pipeline to let you know when the consumption information has been fully calculated.
Rerun Pipeline Grouping: Previously, when pipelines were re-run, all re-runs were retrieved for the monitoring UI, even when the re-runs were not viewed. Now, pipeline groupings for re-runs will only fetch that data when the grouping is expanded. This will improve the performance of monitoring pages that have multiple pipeline re-run groupings.
Time zone settings: Monitoring time zone settings are now saved in your browser cache with your configured columns so that you don’t have to reconfigure your settings each time.
Pipeline runs: When monitoring pipeline runs, all records for pipelines with many activities will be retrieved. We’ve now added pages so that you can navigate through all the activity records.
To learn more about these updates, read Azure Data Factory monitoring improvements.
For CSV data sources, we’ve added a new setting that allows you to set the maximum number of columns. The default is 20480 so if you have more than 20480 columns, you can customize this value and optimize it for your dataset.
To learn more about this setting, refer to this document: Delimited text format in Azure Data Factory.
We have added new support to convert Oracle NUMBER type to corresponding integer in source. You will see a new property convertDecimalToInteger, which will allow you to convert Oracle NUMBER type from zero or an unspecified scale to the corresponding integer.
To learn more about this update, read Copy data to and from Oracle by using Azure Data Factory or Azure Synapse Analytics.
We have added support for sending a body (optional) when using the DELETE method in Web activity. This is useful in the case where your web endpoint expects a body to be passed with the HTTP request DELETE method.
To learn more, refer to Web Activity in Azure Data Factory and Azure Synapse Analytics.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.