Welcome to the November 2021 update for Azure Synapse Analytics! As we reach the end the end of 2021, we're excited show you all the new things that are now available for you.
Table of Contents
Working with Databases and Data Lakes
- Introducing Lake databases (formerly known as Spark databases)
- Lake database designer now available in preview
- Database Templates and Database Designer
- Delta Lake support for serverless SQL is generally available
- Query multiple file paths using OPENROWSET in serverless SQL
- Serverless SQL queries can now return up to 200GB of results
- Handling invalid rows with OPENROWSET in serverless SQL
- Accelerate Spark workloads with NVIDIA GPU acceleration
- Mount remote storage to a Synapse Spark pool
- Natively read & write data in ADLS with Pandas
- Dynamic allocation of executors for Spark
- The Synapse ML library
- Getting started with state-of-the-art pre-built intelligent models
- Building responsible AI systems with the Synapse ML library
- PREDICT is now GA for Synapse Dedicated SQL pools
- Simple & scalable scoring with PREDICT and MLFlow for Apache Spark for Synapse
- Retail AI solutions
- User-Assigned managed identities now supported in Synapse Pipelines in preview
- Browse ADLS Gen2 folders in an Azure Synapse Analytics workspace in preview
Synapse Data Explorer
Synapse Data Explorer now available in preview
At Microsoft Ignite 2021 we announced the public preview of Azure Synapse data explorer. Azure Synapse data explorer complements the Synapse SQL and Apache Spark analytic engines already available in Synapse Analytics. The Synapse data explorer is optimized for efficient log analytics using powerful indexing technology. It automatically indexes free-text and semi-structured data commonly found in telemetry data, such as logs, time series, events, etc. making it easy for users to perform telemetry analytics in Synapse workspaces.
Learn more by reading:
- Synapse data explorer Ignite demo
- Introducing Azure Synapse data explorer for log and telemetry analytics
- Getting started with Azure Synapse data explorer
- What is Azure Synapse Data Explorer?
Working with Databases and Data Lakes
Introducing Lake databases (formerly known as Spark databases)
Previously, Synapse workspaces had a kind of database called a Spark Database. Spark databases had two key characteristics:
- Tables in Spark databases kept their underlying data in Azure Storage accounts (i.e. data lakes)
- Tables in Spark databases could be queried by both Spark pools and by serverless SQL pools.
To help make it clear that these databases are supported by both Spark and SQL and to clarify their relationship to data lakes, we have renamed Spark databases to Lake databases. Lake databases work just like Spark databases did before. They just have a new name.
In this update, you’ll see the new name being used in the service. For example, the screenshot below shows you how they will show up in Synapse Studio.
Lake database designer now available in preview
Until now you’ve had to write code to design databases, tables, etc. In this update, Instead of writing code to design your database, any new Lake databases you create will support a new no-code design experience called the database designer that is built into Synapse Studio.
Learn more about lake databases.
Database Templates and Database Designer
It can be hard to know where to begin if you want to start using a lake database. It’s unclear what tables or schemas you need. In this month’s update we are making it simpler to kickstart the creation of databases for industry verticals by providing database templates.
Database templates leverage decades of industry-specific experience to create an extensible schema to help you design your lake database. The database schemas can be expanded or derived from your own business terminology and taxonomies while Azure Synapse manages connections and storage of the schema. Database templates allow you to easily create the foundation of your enterprise data lake.
Database templates can be explored and customized through Azure the no-code database designer inside Synapse Studio. The Database designer gives you the power to create and define schemas using the database templates as a starting point or starting from scratch.
Currently, Azure Synapse includes database templates for:
- Retail
- Consumer Goods
- Banking
- Energy
- Freight & Logistics
- Agriculture
- Energy & Commodity Trading
- Freight & Logistics
- Oil & Gas
- Utilities
We’ll add more industry-specific database templates over time.
To get started with the database templates, go to the Knowledge Center gallery in Synapse Studio and browse database templates.
Learn more by reading
SQL
Delta Lake support for serverless SQL is generally available
The Delta Lake format enables data engineers to update big data sets with guaranteed ACID consistency. It is an important part of the big data ecosystem that many enterprises rely on to build their cloud analytics solutions.
Azure Synapse has had preview-level support for serverless SQL pools querying the Delta Lake format. This enables BI and reporting tools to access data in Delta Lake format through standard T-SQL. With this latest update, the support is now Generally Available and can be used in production.
To read the content of delta lake folder, use the OPENROWSET function and specify the DELTA format and provide the location of the delta lake folder.
SELECT TOP 10 *
FROM OPENROWSET(
BULK 'https://mystorage.blob.core.windows.net/delta-lake/covid/',
FORMAT = 'delta') as rows;
Find more information in the article How to query Delta Lake files using serverless SQL pools.
Query multiple file paths using OPENROWSET in serverless SQL
Serverless SQL pools in Azure Synapse allow ad-hoc queries through the OPENROWSET statement. So far, you’ve only been able specify only a single file path with OPENROWSET. With this month’s update, you can specify multiple file paths in the BULK parameter of the OPENROWSET statement. To make it even easier, Storage Explorer in Synapse Studio now lets you select multiple file paths by right-clicking to query them with a single SQL script.
Working with multiple file paths as shown above will generate the following code that demonstrate how the paths are used with the BULK parameter:
SELECT TOP 100 * FROM
OPENROWSET(
BULK (
'https://<storage_account_name>.dfs.core.windows.net/opendataset/example/month=1/data.parquet',
'https://<storage_account_name>.dfs.core.windows.net/opendataset/example/month=1/data2.parquet',
'https://<storage_account_name>.dfs.core.windows.net/opendataset/example/month=1/data3.parquet',
'https://<storage_account_name>.dfs.core.windows.net/opendataset/example/month=1/data4.parquet'
),
FORMAT = 'PARQUET'
) AS [result]
Keep in mind that, the file paths DO NOT have to be in the same folder. As the following code shows, each file can be in a different folder.
SELECT TOP 100 * FROM
OPENROWSET(
BULK (
'https://<storage_account_name>.dfs.core.windows.net/opendataset/example/month=1/data.parquet',
'https://<storage_account_name>.dfs.core.windows.net/opendataset/example/month=2/day=10/data.parquet',
'https://<storage_account_name>.dfs.core.windows.net/opendataset/example/year=2021/month=3/data.parquet',
'https://<storage_account_name>.dfs.core.windows.net/opendataset/example2/data.parquet'
),
FORMAT = 'PARQUET'
) AS [result]
To learn more read Query CSV files
Serverless SQL queries can now return up to 200GB of results
We have increased the result set maximum size for Serverless SQL queries to up to 200GB. This is 2.5x the earlier limit of 80GB.
Handling invalid rows with OPENROWSET in serverless SQL
Reading raw data in the data lake for insights is not easy when querying files. Often raw data includes invalid rows that will cause your queries to fail. You can now use OPENROWSET to reject these bad rows and place them in a separate file so you can examine those rows later.
In the OPENROWSET example below, the ERRORFILE_LOCATION and ERROR_DATA_SOURCE options show how to control this behavior.
SELECT * FROM OPENROWSET(
BULK '/persons.txt', format='csv'
, DATA_SOURCE = 'gen2_hn_pass'
, MAXERRORS = 1
, FIRSTROW = 2
, ERRORFILE_LOCATION = 'rejectedrows'
, ERRORFILE_DATA_SOURCE = 'gen2_hn_pass'
)
WITH
(
name varchar(100),
birth_year int
) AS r
To learn more look for reject options in OPENROWSET and external tables.
Apache Spark for Synapse
Accelerate Spark workloads with NVIDIA GPU acceleration
Hardware-accelerated pools are now in public preview for Spark in Synapse. With Hardware-accelerated pools you can speed up big data with NVIDIA GPU-accelerated Spark pools. This can reduce the time necessary to run data integration pipelines, score ML models, and more. This means less time waiting for data to process and more time identifying insights to drive better business outcomes.
GPU-accelerated Spark pools provide performance improvements for ETL and ML workloads with built-in support for NVIDIA RAPIDS and integration with Microsoft’s Hummingbird library.
To learn more:
- GPU-Accelerated Apache Spark Pools Ignite demo
- Save time on Apache Spark workloads on Azure Synapse with NVIDIA GPU acceleration
- GPU-Accelerated Apache Spark Pools - Azure Synapse Analytics
Mount remote storage to a Synapse Spark pool
Mounting remote storage is a common task for developers working with Spark. Previously, there was no direct way to do this in Synapse. Now you can mount/unmount remote storage with new APIs in the mssparkutils package. Mount and attach remote storage (Blob, ADLS Gen2, Azure File Share) to a synapse spark pool in all nodes, including driver and worker nodes.
The Synapse mounting remote storage feature allows you to seamlessly access and manage data without manually entering credentials every time. Once the mount runs successfully, you can navigate data similar to local file system access.
Below is an in example of how to mount a filesystem while taking advantage of Linked Services in Synapse so that authentication details are not in the mounting code.
mssparkutils.fs.mount(
"abfss://mycontainer@storageacct1.dfs.core.windows.net",
"/test",
{"linkedService":"mygen2account"}
)
Learn more by reading Introduction to file mount in synapse - Azure Synapse Analytics
Natively read & write data in ADLS with Pandas
Pandas is an open-source Python package and is widely used for data science and data analysis tasks. It is the most-used Data Frame API in Python and has emerged as a de facto standard used by Python developers and data scientists. The Pandas APIs enables data processing and analysis, starting with simplifying reading data in various formats such as CSV, TSV, JSON, Excel and Parquet files from a plethora of sources.
In this month’s update we have added native support for Azure Storage to Pandas. This is a fundamental and critical capability for any Python-based data science or data processing use case in Synapse. Now easily read and write data to/from ADLSgen2 with Pandas.
We have added support for read/write using Azure URL formats and FSSPEC URL formats in Azure storage. It works well with Primary storge (one which is attached with Synapse workspace by default) as well as non-primary storage (any other azure storage).
We have auto-magic authentication in place which empowers users to use Pandas for primary storage without mentioning any credentials explicitly. For non-primary storage, we support two auth types viz.
- Using linked service (with authentication options - storage account key, service principal, manages service identity and credentials).
- Using storage options to directly pass client ID & Secret, SAS key, storage account key and connection string.
BEFORE – read and write requires special coding
# To read data
import fsspec
import pandas
adls_account_name = '' #Provide exact ADLS account name
adls_account_key = '' #Provide exact ADLS account key
fsspec_handle = fsspec.open('abfs://<container>/<path-to-file>', account_name=adls_account_name, account_key=adls_account_key)
with fsspec_handle.open() as f:
df = pandas.read_csv(f)
# To write data
import fsspec
import pandas
adls_account_name = '' #Provide exact ADLS account name
adls_account_key = '' #Provide exact ADLS account key
data = pandas.DataFrame({'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]})
fsspec_handle = fsspec.open('abfs://<container>/<path-to-file>', account_name=adls_account_name, account_key=adls_account_key, mode="wt")
with fsspec_handle.open() as f:
data.to_csv(f)
AFTER – read and write become one-liners natural idioms with dataframes
#Read Operation
df = pandas.read_csv('abfss://files@synapseamlprodstorage.dfs.core.windows.net/Testdata.csv', storage_options = {'linked_service' : 'AzureDataLakeStorage'})
#Write Operation
data = pandas.DataFrame({'Name':['Azure', 'Bing', 'Office', 'Xbox'], 'Age':[20, 21, 19, 18]})
data.to_csv('abfss://files@synapseamlprodstorage.dfs.core.windows.net/pandas/write/test_2.csv', storage_options = {'linked_service' : 'AzureDataLakeStorage'})
To learn more read Use Pandas to read/write Azure Data Lake Storage Gen2 data in serverless Apache Spark pool in Synapse Analytics
Dynamic allocation of executors for Spark
Dynamic allocation of executors for Apache Spark improves your capacity utilization efficiency. With this release, you can now enable executors' dynamic allocation for Spark at the pool, Spark job, or Notebook session-level. Dynamic allocation allows you to customize how your clusters scale based on workload.
Enabling dynamic allocation allows the job to scale the number of executors within the minimum and the maximum number of executors specified.
You can enable the dynamic allocation of executors for pools as shown below.
You can enable the dynamic allocation of executors for notebooks via code.
%%configure -f
{
"conf" : {
"spark.dynamicAllocation.maxExecutors" : "6",
"spark.dynamicAllocation.enable": "true",
"spark.dynamicAllocation.minExecutors": "2"
}
}
To learn more read Automatically scale Azure Synapse Analytics Apache Spark pools.
Machine Learning
The Synapse ML library
With this update, we have released the Synapse ML library (previously known as MMLSpark). The Synapse ML library is an open-source library that simplifies the creation of massively scalable ML pipelines. Building production-ready distributed machine learning (ML) pipelines can be challenging, even for the most seasoned researcher or engineer. Composing tools from different ecosystems often requires considerable “glue” code, and many frameworks aren’t designed with thousand-machine elastic clusters in mind. The library unifies several existing ML frameworks and new Microsoft algorithms into a single, scalable API that’s usable across Python, R, Scala, and Java.
Developers can now build large-scale ML pipelines using the Microsoft Cognitive Services, LightGBM, ONNX, and other selected Synapse ML library features. It even includes templates to help users quickly prototype distributed ML systems, such as visual search engines, predictive maintenance pipelines, document translation, and more.
To learn more, read:
- Streamline collaboration and insights with simplified machine learning on Azure Synapse Analytics.
- Synapse ML Library
Getting started with state-of-the-art pre-built intelligent models
The Synapse ML library provides simple APIs for pre-built intelligent services, such as the Microsoft Cognitive Services, to quickly solve business and research challenges on large datasets. It enables developers to embed over 45 different state-of-the-art ML services directly into their systems and databases.
In the latest release, we added support for distributed form recognition, conversation transcription, and translation. These ready-to-use algorithms can parse a wide variety of documents, transcribe multi-speaker conversations in real time, and translate text to over 100 different languages.
To help you understand how to use it, we’ll walk through an example of using form recognition.
Let’s start with a form that saved as a JPG file in your data lake. The form looks like this:
To handle this form, you would use code like the following PySpark code
from pyspark.sql.functions import col, flatten, regexp_replace, explode, create_map, lit
imageDf = spark.createDataFrame([
("<replace with your file path>/layout.jpg",)
], ["source",])
analyzeLayout = (AnalyzeLayout()
.setLinkedService(cognitive_service_name)
.setImageUrlCol("source")
.setOutputCol("layout")
.setConcurrency(5))
display(analyzeLayout
.transform(imageDf)
.withColumn("lines", flatten(col("layout.analyzeResult.readResults.lines")))
.withColumn("readLayout", col("lines.text"))
.withColumn("tables", flatten(col("layout.analyzeResult.pageResults.tables")))
.withColumn("cells", flatten(col("tables.cells")))
.withColumn("pageLayout", col("cells.text"))
.select("source", "readLayout", "pageLayout"))
And this code will give these results:
To learn more read Form Recognizer with Applied AI Service.
Building responsible AI systems with The Synapse ML library
After building a model, it’s imperative that researchers and engineers understand its limitations and behavior before deployment. The Synapse ML library helps developers and researchers build responsible AI systems by introducing new tools that reveal why models make certain predictions and how to improve the training dataset to eliminate biases. The library includes scalable methods that can explain the behavior of any trained machine learning model. In addition to supervised model explainability, the library also introduces several new capabilities for unsupervised responsible AI. With our new tools for understanding dataset imbalance, researchers can detect whether sensitive dataset features, such as race or gender, are over- or under-represented and take steps to improve model fairness. Finally, Azure Synapse Analytics users can take advantage of the private preview of a distributed implementation of Explainable Boosting Machines, which combines the modeling power of gradient-boosted trees with the interpretability of linear additive models. Explainable Boosting Machines allow data scientists to learn high-quality nonlinear models without sacrificing their ability to understand a model’s predictions.
STEP 1: Import feature balancer transformer
from synapse.ml.exploratory import FeatureBalanceMeasure
STEP 2: Load dataset and define features of interest
import pyspark.sql.functions as F
features = ["race", "sex"]
label = "income"
df = spark.read.parquet("wasbs://publicwasb@mmlspark.blob.core.windows.net/AdultCensusIncome.parquet")
# Convert the "income" column from {<=50K, >50K} to {0, 1} to represent our binary classification label column
df = df.withColumn(label, F.when(F.col(label).contains("<=50K"), F.lit(0)).otherwise(F.lit(1)))
STEP 3: Create a FeatureBalanceMeasure transformer. Use setSensitiveCols to set the list of sensitive features. Use setLabelCol to set the binary label column. Then, call the transform method with your dataset. Finally, visualize the resulting dataframe with the show method .
feature_balance_measures = (
FeatureBalanceMeasure()
.setSensitiveCols(features)
.setLabelCol(label)
.transform(df)
)
feature_balance_measures.show(truncate=False)
PREDICT is now GA for Synapse Dedicated SQL pools
The PREDICT function in Azure Synapse (Dedicated SQL pool) aims to empower data professionals to easily deploy machine learning models and collaborate with data scientists. Synapse users can score machine learning models using the familiar T-SQL language without the need to move data outside secure data warehouse boundaries. PREDICT takes an ONNX (Open Neural Network Exchange) model and data as inputs. It is fully integrated in T-SQL language and it can be called multiple times in a single statement or be a part of data modification (through INSERT, UPDATE, DELETE or MERGE statements).
To try it out and learn more read Machine learning model scoring wizard (preview) for dedicated SQL pools.
Simple & scalable scoring with PREDICT and MLFlow for Apache Spark for Synapse
We are simplifying batch scoring at scale in Azure Synapse. The PREDICT keyword on Spark, now in public preview, drastically simplifies the handoff between an ML model producer and a person operationalizing the model for batch scoring (the model consumer). It does this by allowing you to point to MLFlow packaged and registered models in Azure Machine Learning directly from your Synapse workspace. Users can easily run predictions at large scale on Spark on Synapse using a variety of MLFlow model flavors to streamline and simplify the batch scoring process. PREDICT also supports referencing models stored in Azure Data Lake Storage Gen2.
The model flavors supported in this public preview are Sklearn, Pytorch, ONNX and TensorFlow, and we continuously work on expanding this list. PyFunc models can also be used with PREDICT. As a user, all you need to do is to point to your model and the data within the secured boundaries of Azure Synapse. Information like model type and expected inputs and outputs are simply picked up from the MLFlow packaging format, which the model producer captured at the time of training the model.
To learn more read Score machine learning models with PREDICT in serverless Apache Spark pools.
Retail AI solutions
Database templates in Azure Synapse are industry-specific schema definitions that provide a quick method of creating a database known as a lake database. After the data is loaded into your lake database, you can take advantage of pre-built AI and ML models that understand your data based on the lake database template schema. An example is the Retail-Product recommendation solution in the Knowledge Center gallery:
When you use this template it creates a notebook you can start using immediately.
Knowing the shape of the data allows us to provide pre-built industry AI solutions. The AI solution for Retail Product Recommendation provides a robust and scalable recommendation engine for out-of-the-box development in Synapse. No additional data shaping is needed, the solution can work on the data out of the box. This accelerates productivity of existing or emerging data scientists for solving a specific problem in the Retail domain.
To learn more about the Retail AI solution, use this quick start guide.
Security
User-Assigned managed identities now supported in Synapse Pipelines in preview
Previously, linked services used for authentication in Synapse pipelines only supported system-assigned managed identity.
Now you can use user-assigned managed identities in linked services for authentication in Synapse Pipelines and Dataflows. Unlike system-assigned managed identities, user-assigned managed identities are not tied to the lifecycle of a Synapse workspace. This also allows organizations to leverage a user-assigned managed identity across multiple Synapse workspaces as well as data sources, reducing the administrative overhead associated with granting and auditing access across multiple system assigned managed identities.
In order to use user-assigned managed identity in a linked service, you must first create a credential for a user-assigned managed identity. Then create or edit a linked service and choose User-Assigned Managed Identity (Preview) as the authentication method, and select the newly created credential from the Credentials drop-down.
Step 1 – create a credential for a user-assigned managed identity
Step 2 – create a credential for a user-assigned managed identity
Step 3 – create a linked service and provide the credential
For more about using User-assigned managed identities (UAMI) and how to correctly configure your UAMI for use in linked services, see Using Credentials in Azure Synapse.
Browse ADLS Gen2 folders in an Azure Synapse Analytics workspace in preview
Directly browsing an ADLS Gen 2 account makes it easy to start analyzing data in the lake. With this update, we’ve added another way to do this. You can now browse an Azure Data Lake Storage Gen2 (ADLS Gen2) container or folder in your Azure Synapse Analytics workspace by connecting to a specific container or folder in Synapse Studio.
This is useful If your organization does not grant users the Storage Blob Data Contributor role from Azure RBAC on the storage account, but grants POSIX-like access control lists (ACLs) to the container or specific folders to users' Azure AD identity. In this case you can use this preview feature to browse the ADLS Gen2 container or folder.
To use this feature, first connect the ADLS Gen2 container or folder to your Synapse workspace. Right click on Azure Data Lake Storage Gen2 and select Connect to Azure Storage (preview). Once you provide the necessary details and click Connect, you will be able to browse contents of the container or folder by clicking on Attached Containers.
Step 1 – Connect to the Azure Data Lake Storage Gen2 container or folder
Step 2 – Connect to the Azure Data Lake Storage Gen2 container or folder
Step 3 – Browse the files and folders in the connected storage container or folder.
Read more about the capability and its prerequisites at Browse ADLS Gen2 folders (preview) in an Azure Synapse Analytics workspace.
Data Integration
Pipeline Fail activity
When orchestrating data processing jobs in Azure Synapse pipelines, you might occasionally want to throw an error in a pipeline intentionally. For example, a Lookup activity might return no matching data, or a Synapse Spark Notebook activity might finish with an error. Whatever the reason might be, now you can use a Fail activity in a pipeline and customize both its error message and error code.
Within your pipeline logic, you can add a Fail activity with your own custom message and error code that will be returned to the pipeline monitoring backend as a failure. Now, you can view the return code and message from the pipeline monitoring view.
To learn more read Execute a Fail activity in Azure Data Factory and Synapse Analytics
Mapping Data Flow gets new native connectors
Data flows allow data transformation using a visual designer instead of writing code. We’ve added two new native mapping data flow connectors for Synapse Analytics. You can now connect directly to your AWS S3 buckets for data transformations and Azure Data Explorer clusters.
With these additional connectors, you can build ETL patterns at Spark scale in a code-free design environment without ever touching the Spark compute. Azure Integration Runtimes allow you to define the Spark environment and provide a serverless Spark compute for your data transformation pipelines.
The screenshot below shows an ADX connector used as a source in the data flow to read NYC taxi data.
To learn more read Mapping Data Flow gets new native connectors.
Synapse Link
Synapse Link enables near real-time analytics over your operational data with a simple experience and no managed-ETL. By bringing your operational data to Synapse, this feature enables limitless cloud analytics capabilities and eliminates barriers to doing analytics with operational data and business application
Synapse Link for Dataverse
Synapse Link for Dataverse accelerates time to insights from your valuable business applications data using Azure Synapse Analytics. Previously available in Preview, Synapse Link for Dataverse is now Generally Available.
To learn more read Create an Azure Synapse Link for Dataverse with your Azure Synapse Workspace.
Custom partitions for Synapse link for Azure Cosmos DB in preview
Synapse Link for Azure Cosmos DB uses the Cosmos DB Analytical Store to let Synapse get insights into data from Cosmos DB. Getting the right partitioning for analytical data is critical getting the best performance, but so far there has been no way to control the partitioning used by the Analytical Store. With this update, you can improve query execution times for your Spark queries, by creating custom partitions based on fields frequently used in your queries.
To learn more read Support for custom partitioning in Azure Synapse Link for Azure Cosmos DB.
Additional capabilities released recently for Synapse link for Cosmos DB:
- Synapse Link for serverless accounts
- Full-fidelity schema for Core (SQL) API accounts.
Synapse Link for SQL Server
Synapse Link is all about accelerating time to insight from the data you already have. Synapse Link for SQL Server extends this to transactional data residing on-premises. You can now apply for the preview of Synapse Link for SQL Server 22.