Data virtualization now generally available in Azure SQL Managed Instance

Microsoft

Sep 12, 2022

[Edit: Video with SQL MI Data virtualization use cases and demo.]

We are excited to announce the general availability (GA) of data virtualization capabilities in Azure SQL Managed Instance, with improved query performance and managed identity as a new supported option for authenticating to storage accounts.

The data virtualization enables you to execute Transact-SQL (T-SQL) queries on files storing data in common data formats in Azure Data Lake Storage Gen2 or Azure Blob Storage and combine it with relational data stored locally in the managed instance using logical joins. This way you can transparently access external data while keeping it in its original format and location. There is no data duplication or need to run and maintain ETL processes, which means that you can extract and deliver insights faster. The supported file formats are Parquet, CSV, and JSON.

What’s new

The feature is now enabled by default on all the existing and newly created instances. No need to run sp_configure command to enable it before running queries.

Managed identity authentication is now available for authenticating to storage account hosting the files. Managed Identity - also known as MSI - is a feature of Azure Active Directory (Azure AD) that provides instances of Azure services - like Azure SQL managed instance - with an automatically managed identity in Azure AD. This identity can be used to authorize the request for data access in non-public storage accounts. For comparison of the managed identity authentication and Storage Access Key (SAS) authentication, check the data virtualization feature documentation.

Significant query performance improvements, especially for data sets partitioned across subfolders. By providing filter predicates aligned with partition columns – typically year, month, day and similar - you can instruct managed instance to query only particular folders and files. This is known as partition pruning or partition elimination. Doing so reduces the number of files and the amount of data the query needs to read and process, resulting in a better performance.

When to use data virtualization

Challenges with data are common across industries: there’s more and more data; it needs to be analyzed faster than ever; and it’s coming in an increasing variety of formats. To respond to the challenges, you might be looking for cost-efficient and reasonably fast ways to modernize your data platform and make it more flexible and future ready. In many cases the preferred approach is the one allowing you to achieve the goals while using the data engine you’re familiar with and the language, frameworks, and tools of your choice.

Typical use cases for data virtualization include:

Providing always up-to-date relational abstraction on top of your raw or disparate data without relocating or transforming it. This way any application capable of running T-SQL queries can consume the data: from BI solutions like Power BI, to line of business applications, to client tools like SQL Server Management Studio or Azure Data Studios. This is an easy and elegant way to expand the list of data sources for your operational reporting solutions.
Reducing managed instance storage consumption and total cost of ownership (TCO) by archiving cold data to Azure Data Lake Storage, keeping it still within the reach of interactive queries and joins
Exploratory data analysis of data sets stored in the most common file formats. This technic is typically used by data scientists and data analysts to collect insights about the data set, from basic ones like number and structure of records, to extracting important variables, detecting outliers and anomalies, and testing underlying assumptions

There are also scenarios that don’t fall into data virtualization category but can benefit from the ability to query files using T-SQL. For example, inserting results of an external query to a local table can simplify existing ETL pipelines or even remove the need for dedicated data integration tools.

Data virtualization provides two ways of querying files intended for different sets of scenarios:

OPENROWSET syntax – optimized for ad-hoc querying of files. Typically used to quickly explore the content and the structure of a new set of files.
External tables – optimized for repetitive querying of files using identical syntax as if data were stored locally in the database. External tables require several preparation steps compared to the OPENROWSET syntax but allow for more control over data access. External tables are typically used for analytical workloads and reporting.

Getting started

If you are familiar with PolyBase feature of SQL Server, you may have already recognized the scenarios and underlying concepts. Data virtualization capabilities of Azure SQL Managed Instance use the same syntax as PolyBase and enrich it further with new options. PolyBase queries running on your SQL Server instance and targeting files stored in Azure Data Lake Storage or Blob Storage will continue working on your managed instance with minimal intervention to specify location prefix corresponding to the type of external source and endpoint, like abs:// instead of the generic https:// location prefix.

For simplicity, we are going to use publicly available Bing COVID-19 dataset that allows anonymous access. To quickly start exploring the data set, you can simply run the following query:

--Number of confirmed Covid-19 cases per country/territory during the first pandemic year:
SELECT countries_and_territories, sum(cases) FROM
	OPENROWSET(
        BULK 'abs://public@pandemicdatalake.blob.core.windows.net/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',
        FORMAT='PARQUET'
    ) AS [CovidCaseExplorer]
WHERE year = '2020'
group by countries_and_territories
order by sum(cases) desc

For more complex use cases with non-public data sets and advanced capabilities of data virtualization with Azure SQL Managed Instance, explore the sample queries in the feature documentation referenced at the end of the article.

Summary

Data virtualization capability - now generally available in Azure SQL Managed Instance - introduces a data abstraction concept enabling you to transparently query data stored in files in Azure Data Lake Storage and Azure Blob Storage. It enables any application capable of running T-SQL queries to consume that data by simply connecting to managed instance. It can help you optimize the TCO thanks to lower storage costs compared to storing data in a database. Finally, it improves time-to-insight by offering instant access to up-to-date data without the need to transform or relocate it. General availability brings a new, more secure way of authentication to storage account using Managed Identity, and significant performance improvements for data sets partitioned across subfolders. The feature is now enabled by default on all the existing and newly created instances.