Azure Synapse Analytics November Update 2022
Published Nov 30 2022 08:00 AM 5,960 Views
Microsoft

Azure Synapse Analytics November Update 2022

Welcome to the November 2022 update for Azure Synapse Analytics! This month, you’ll find sections on increased Spark performance, the new Kusto Emulator, as well as additional updates in Apache Spark for Synapse, Synapse Data Explorer, and Machine Learning.

Table of contents

Apache Spark for Synapse

Increasing Spark performance

We are always working to improve Azure Analytics Spark performance. We are making significant changes that will increase Spark performance by up to 77%. 

Based on our testing using the 1TB TPC-H industry benchmark, you're likely to see up to 77% increased performance. While your workload may perform differently than the TPC-H benchmark, everyone is expected to see improved performance. These Spark performance improvements come from moving to the latest Azure v5 VMs which have improved CPU performance, increased temporary SSD throughput, and lastly higher remote storage IOPS. 

We have over 40 regions worldwide and will be implementing this change region by region. Canada Central will be the first region we implement. We expect these changes to take many months to roll out worldwide. We will publish each region that we update, and customers will automatically receive the performance increase in each region at no cost. 

There are no actions that are required. After each region is upgraded, your jobs will complete in less time. You could choose to reduce the node size or the number of nodes if cost savings are more important to you than job completion elapsed time.

To learn more about the increase in Spark performance, read optimizing Spark performance and Apache Spark pool configurations

Synapse Data Explorer

ADX Emulator

The ADX Emulator is a Docker Image exposing an ADX Query Engine endpoint. You can use it to create databases and ingest and query data. The emulator understands Kusto Query Language (KQL) the same way the Azure Service does. We can therefore use it for local development and be ensured the code is going to run the same in an Azure Data Explorer cluster. We can also deploy it in a CI/CD pipeline to run automated test suites to ensure our code behaves as expected.

To learn more about the Emulator, read ADX Emulator and watch Kusto Emulator on YouTube

Ingesting files from AWS S3

Amazon S3 is one of the most popular object storage services. AWS Customers use Amazon S3 to store data for a range of use cases, such as data lakes, websites, mobile applications, backup and restore, archive, applications, IoT devices, log analytics, and big data analytics.

With the native S3 ingestion support in ADX, customers can bring data from S3 natively without relying on complex ETL pipelines. Customers can also create a continuous data ingestion pipeline to bring data from S3.

To learn more about ingesting files from AWS S3, read Azure Data Explorer supports native ingestion from Amazon S3

Azure Stream Analytics ADX output [Generally Available]

Azure Data Explorer output for Azure Stream Analytics is now Generally Available. ASA-ADX output has been available in Preview since last year. Customers can build powerful real time analytics architecture by leveraging ASA and ADX together. With this new integration, Azure Stream Analytics job can natively ingest the data into Azure Data Explorer and Synapse Data Explorer tables.

To learn more about Azure Stream Analytics ADX output, read about the output plugin set up and ASA-ADX common use cases.

Open Telemetry exporter

OpenTelemetry (OTel) is a vendor-neutral open-source observability framework for instrumenting, generating, collecting, and exporting telemetry data such as traces, metrics, and logs.

ADX OpenTelemetry exporter supports the ingestion of data from many receivers into Azure Data Explorer, allowing customers to instrument, generate, collect, and store data using a vendor-neutral open-source framework.

To learn more Open Telemetry exporter, read Ingest data from OpenTelemetry to Azure Data Explorer

Streaming support in Telegraf connector

Telegraf is an open source, lightweight, minimal memory footprint agent for collecting, processing, and writing telemetry data including logs, metrics, and IoT data. The Azure Data Explorer output plugin serves as the connector from Telegraf and supports ingestion of data from many types of input plugins into Azure Data Explorer.

We have added support for "managed" steaming ingestion in Telegraf, which defaults to streaming ingestion providing latency up to a second when the target table is streaming enabled, with a fallback to batched or queued ingestion.

To learn more Telegraf, read Ingest data from Telegraf into Azure Data Explorer

Protobuf support in Kafka sink

Protocol buffers (Protobuf) are a language and platform-neutral extensible mechanism for serializing and deserializing structured data for use in communications protocols and data storage. Azure Data Explorer Kafka sink - a gold certified Confluent connector - helps ingest data from Kafka to Azure Data Explorer. We have added Protobuf support in the connector to help customers bring Protobuf data into ADX.

To learn more Protobuf support, read Ingesting Protobuf data from Kafka to Azure Data Explorer.

Leader follower discoverability

We have enhanced the discoverability of leader & follower databases in your ADX clusters. You can visit the database blade in Azure portal to easily identify all the follower databases following a leader, and the leader for a given follower. The details pane also provides granularity around which specific tables, external tables, and Materialized views have been included or excluded.

To learn more about leader follower discoverability, read Use follower databases

Aliasing follower databases

The follower database feature allows you to attach a database located in a different cluster to your Azure Data Explorer cluster. Prior to aliasing capability, a database named DB created on the follower cluster took precedence over a database with the same name that was created on the leader cluster, not allowing databases with same name to co-exist. But now you can override the database name while establishing a follower relationship. This allows you to follow multiple databases with the same name from multiple leader clusters or even just make a database available to users with a more user-friendly name.

You can either use a databaseNameOverride property to provide a new follower database name or use databaseNamePrefix when following an entire cluster to add a prefix to all of the databases original names from leader cluster.

To learn more about aliasing follower databases, read Attached Database Configurations - Create Or Update

For usage code samples, see Use follower databases

Parse-kv operator

A new operator which extracts structured information from a string expression and represents the information in a key/value form.

The following extraction modes are supported:

  • Specified delimeter: Extraction based on specified delimiters that dictate how keys/values and pairs are separated from each other.
  • Non-specified delimeter: Extraction with no need to specify delimiters. Any non-alphanumeric character is considered a delimiter.
  • Regex: Extraction based on RE2 regular expression.

To learn more about the new operator, read parse-kv operator.

Scan operator

This powerful operator enables efficient and scalable process mining and sequence analytics and user analytics in ADX. The user can define a linear sequence of events and ‘scan’ will quickly extract all sequences of those events. Common scenarios for using ‘scan’ include preventive maintenance for IoT devices, customers funnel analysis, recursive calculation, security scenarios looking for known attack steps and more.

To learn more about the new operator, read scan operator.

Machine Learning

R Support [Public Preview]

Azure Synapse Analytics now provides built-in R support for Apache Spark; this capability is currently in public preview. The R language enables data scientists to apply the industry standard R language to process data and develop ML models and do analysis of their data. Data scientists and analysts can now leverage R in Azure Synapse Analytics through the following capabilities:

  • Azure Synapse Analytics R runtime: Azure Synapse Analytics supports an R runtime that features many popular open-source R packages.
  • Access Apache Spark through R: Azure Synapse Notebooks also include support for SparkR and SparklyR, which allows users to interact with Spark using familiar Spark or R interfaces.
  • Import custom R packages: Users can standardize the R packages on an Azure Synapse Apache Spark pool by uploading the package as a workspace package.
  • Install session-scoped packages: When doing interactive data analysis or machine learning, you might try newer packages, or you might need packages that are currently unavailable on your Apache Spark pool. Instead of updating the pool configuration, users can now use the familiar R syntax to add, manage, and update session dependencies.

With the new R support, you can install an R library from CRAN and CRAN snapshots. In the example below, Highcharter is a popular package for R visualizations. You can install this package on all nodes within your Apache Spark pool using the following command:

install.packages("highcharter", repos = https://cran.microsoft.com/snapshot/2021-07-16/) 

Another feature is that you can create a SparkR dataframe using the Spark Data Source API using the following code:

# Read a csv from ADLSg2 

df <- read.df('abfss://<container name>@<storage account name>.dfs.core.windows.net/<file name>.csv', 'csv', header="true") 

head(df) 

To learn more about how you can start leveraging R in Synapse, read Use R for Apache Spark with Azure Synapse Analytics

2 Comments
Co-Authors
Version history
Last update:
‎Nov 30 2022 08:58 AM
Updated by: