[Edit: CETAS demo video.]
The familiar concept of Data Virtualization in SQL Managed Instance offers capability to read remote Parquet or CSV files through OPENROWSET or External Tables, which you could join with local data.
With Create External Table As Select (CETAS) functionality for Azure SQL Managed Instance, Generally available as of 18th May 2023, you can now export local data from your Managed Instance as Parquet or CSV files to Azure Storage accounts and query this data as an External Table. There are two main scenarios facilitated by this functionality:
- Offloading the data from local storage of SQL Managed Instance to cheaper external storage, to reduce costs, when offloaded data is expected to be rarely used and not updated going forward.
- Exporting local data from SQL Managed Instance into storage accounts on Azure, to be picked up by analytics solutions such as Azure Synapse Analytics – useful in cases when quick exports are needed and setting up an ETL would be too cumbersome
Scenario: Offloading "cold" data into external storage
There are many situations where your operational database may grow large, containing data that is rarely accessed anymore. A couple of example scenarios: sales orders from a few years ago; audit or application logs, etc.
This data may only be occasionally requested by the database’s client, yet it holds a lot of space. Since we can’t completely delete it from our database, we could export it to a cheaper storage (Azure Blob Storage or Azure Data Lake Gen 2) and create an external table on top of it. We could then decide to remove it from database’s local storage. In such a way, it can still be queried using T-SQL from remote location. And we could go a step further and create a view that makes a union of the external table and the remaining data in the local table – almost making it transparent to our application workloads, that parts of the underlying data have changed location. Important to know when planning for this scenario: external tables created via CETAS are read only. You can't do inserts into them.
A derivative scenario to this may be: using Azure storage to distribute a single shared data set to be used by multiple consumers. (i.e. in case of multi-tenant application setups). A “master” database (not belonging to any specific tenant) could export the relevant data into the commonly used storage, for tenants to consume.
Scenario: Exports for analytics purposes avoiding ETL setups
In current age, making informed business decisions has become a norm for any person or business striving to be successful. Being able to effectively analyze data generated by various systems that power the business, and to plan next steps based on this analysis is an important advantage to be leveraged.
Using data for business decisions is not a new concept, but making it happen is getting increasingly hard: the volumes of data are growing at an unprecedented pace, and the data sources are often scattered in more and more operational databases. The larger the context, the better chance for competitive insights; however, building larger contexts requires extracting data from more data sources.
Especially in cases when there are needs for quick experiments to validate assumptions, adding new sources into the existing data reporting setup should be as easy as possible, to support agility and innovation. Traditional ETL pipelines are often too slow and expensive to establish and run, reducing this agility.
In context of a transactional database server, Azure SQL Managed Instance is one of the data sources for analytics scenarios and CETAS functionality is aimed at helping companies to export their transactional data towards their analytics solutions in an agile fashion.
Using the feature
Below are the steps that are required to start using this feature. Almost all steps pertain to security:
- Enable the feature
- Prepare the destination storage account
- Ensure network connectivity between your SQL MI and the target storage account
- Ensure security/authorization for your MI to be able to write to the storage account
- Use the feature
Let’s break these down into further details.
1. Enable the feature
Unlike the related feature, Data Virtualization, which is ready to use for any Azure SQL Managed Instance out of the box, CETAS functionality is purposely turned off by default: it is okay to be able to query remote data (through Data Virtualization), but exporting data needs to be under firmer control. Therefore, it can't be enabled by T-SQL and a user with special permission is required to enable the feature. (This ensures separation of duties, typically DBA vs a security-oriented role).The principal enabling the feature needs to belong to Security Admin, Contributor or Owner RBAC roles. Alternatively, a custom role can be used, in which case it needs to have Read and Write actions for the Microsoft.Sql/managedInstances/serverConfigurationOptions action.
CETAS can be enabled via PowerShell or CLI by setting configuration option “allowPolybaseExport” to 1.
Below example shows how to achieve it with PowerShell. Please note that this code requires Az.Sql module version 4.6.0. You can install the latest version by running Install-Module -Name Az.Sql .
You should adjust below example with your Azure subscription ID, resource group name and managed instance name:
Login-AzAccount
$SubscriptionID = "<YourSubscriptionIDHERE>"
Select-AzSubscription -SubscriptionName $SubscriptionID
# Set ServerConfigurationOption with name "allowPolybaseExport" to 1
Set-AzSqlServerConfigurationOption -ResourceGroupName "resource_group_name" -InstanceName "ManagedInstanceName" -Name "allowPolybaseExport" -Value 1
If you prefer Azure CLI, the same can be achieved as shown below. This requires Az version 9.7.0 and Azure CLI version 2.48.0. If you are running earlier version of Azure CLI, consider running az upgrade to get to the latest version.
Again, please adjust with your Azure subscription ID, resource group, and managed instance name:
az login
$subscriptionId = "<your subscription ID>"
az account set --subscription $subscriptionId
# Set server-configuration-option with name "allowPolybaseExport" to 1
az sql mi server-configuration-option set --resource-group 'resource_group_name' --managed-instance-name 'ManagedInstanceName' --name 'allowPolybaseExport' --value '1'
As mentioned earlier above, please note that setting this option is not allowed through T-SQL with sp_configure stored procedure. Due to separation of duties, it needs to be set via PowerShell, CLI or REST API.
Also bare in mind that the principal executing the script needs to be in appropriate RBAC role, as previously explained.
The easiest way to check whether CETAS is enabled is to run the following T-SQL query on your managed instance, like so:
SELECT * FROM sys.configurations WHERE name ='allow polybase export';
Alternatively, you can check the documentation of the feature, to see how to achieve the same using PowerShell, Azure CLI or API calls.
2. Preparing the storage account
CETAS can currently write data into Azure storage accounts only. Supported types of storage accounts are Azure Blob Storage and Azure Data Lake Gen 2.
Ensure that you have a storage account and ensure that it has a container created.
Ensure that your container is empty, or that the target path for writing the files is empty at the level of the bottom-most folder (the one where parquet/CSV files should be written). Otherwise, attempting to create an external table will fail at the time SQL MI attempts to write into the non-empty folder.
3. Network connectivity
Your SQL Managed Instance needs to have network access towards your storage account. Normally, a SQL MI comes pre-configured with “blanket allow” access to Azure Cloud (including Storage), but security conscious organizations do lock this down to various extent, which may result in a need to configure network access from your MI to your specific storage account.
There are a few ways to achieve this, including Private Endpoints and Service Endpoints. If you use Service Endpoints, be sure to check if any Service Endpoint Policies (SEP) are present in your Managed Instance’s VNet, and if so, make sure that your destination storage account appears in their list of allowed resources.
If you fail to set up connectivity or forget to take care of SEP where it’s needed, you will be getting connection errors when using CETAS.
Further details about network/connectivity are outside of this article scope, but here is a good place to read more about how to practice security with your MI connecting to storage accounts.
4. Authorization
Your SQL Managed instance needs to be allowed to write into the designated storage account.
Depending on whether you use Managed Identity or SAS tokens, below are the minimum permission levels to be used:
- Managed Identity authorization requires a minimum Storage Data Contributor RBAC role to be given to your MI’s managed identity principal.
- SAS token authorization requires that your SAS token has Read, Write and List permissions. Technically, you could get away without “List” permission: you would be able to create your external table and export data onto Storage Account, but you won’t be allowed to query the External table you just created.
Tip: If the permission set is not adequate, regardless of authorization type, your CETAS query would fail with HRESULT 0x8007000d.
5. CETAS
Once you have prepared everything following the steps above, you can use this feature by executing T-SQL code. Let’s show an example, broken down in a series of statements building to the final CETAS statement. We will use AdventureWorks2019 database, but it takes almost no effort to adjust this example to any database you want to test with.
In order to complete this example, we will do the following:
- Create credentials for accessing storage account
- Create external data source
- Define the designated file format
- Create the external table and select data into it
1. Credentials
First, we need to prepare our credentials. Choose a flavor of credential creation based on your preference. Replace the content <within angled brackets> with your respective context:
USE [AdventureWorks2019];
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<Strong Password Here>'
GO
-- If you prefer Managed Identity authentication:
CREATE DATABASE SCOPED CREDENTIAL [CETASDemoCredential]
WITH IDENTITY = 'Managed Identity' -- this is a string constant, do not replace
GO
-- If you prefer SAS token authentication:
CREATE DATABASE SCOPED CREDENTIAL [CETASDemoSASCredential]
WITH IDENTITY = 'SHARED ACCESS SIGNATURE', -- this is a string constant, do not replace
SECRET = '<Your SAS Token here>'
GO
2. External data source
The data source will tell the CETAS command which credentials to use and where to put the data.
This step is where mistakes happen the most often! The URL pattern for this feature is different to the URL patterns for Azure Storage that you may be used to. If you simply copy path to your folder from Azure Storage Explorer, and paste it unmodified into your T-SQL statements, you are going to get errors. Thus, you need to adjust your path, as per below example.
Please observe these concepts when constructing the location URL:
- Location URL starts with abs:// rather than https://. This tells SQL MI to use the latest Data Virtualization tech stack. Please do not use HTTPS in your URL.
- Location URL format expects you to start with container name and follow it with an @ sign, and then put the storage account name, before continuing with .blob.core.windows.net and finishing with any path you want to point to in your container.
-- URL pattern to use: abs://container@storageaccount.blob.core.windows.net/YourFolders/
CREATE EXTERNAL DATA SOURCE [CETASDemoDataSource]
WITH (
LOCATION = 'abs://<container>@<storageaccount>.blob.core.windows.net/',
CREDENTIAL = [CETASDemoCredential] ) -- or [CETASDemoSASCredential] if you prefer SAS
GO
3. Designated file format
The External File Format tells CETAS how to output the data.
For SQL Managed Instance, supported output formats are Parquet and CSV and this is mandatory to provide. If you wish to use CSV, use “DELIMITEDTEXT” constant instead of PARQUET in example below, as indicated in comments.
This is where you will be able to control various other options, such as compression, field separators, date formats, etc. You can find more details here.
CREATE EXTERNAL FILE FORMAT [CETASDemoFileFormat]
WITH(
FORMAT_TYPE=PARQUET, --use DELIMITEDTEXT for CSV
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec' --compression is optional
)
GO
4. Create external table as select
In this step, we wire everything together by executing the CETAS command. We will select SalesOrderHeader table data from 2011 into the external table. You can construct your own SELECT statement here, if you are not following the AdventureWorks2019 example.
CREATE EXTERNAL TABLE SalesOrdersExternal
WITH (
LOCATION = 'OrdersData/', --Will be appended to data source path. Important to end this string with /
DATA_SOURCE = [CETASDemoDataSource],
FILE_FORMAT = [CETASDemoFileFormat]
)
AS
SELECT
*
FROM
[Sales].[SalesOrderHeader]
WHERE
Year(OrderDate)=2011
The statement exported 1607 rows into the Parquet file.
We can now verify by querying our external table:
SELECT COUNT (*) FROM SalesOrdersExternal
We can inspect the execution plan, to make sure we are indeed querying an external table. Notice the Remote Scan:
As you can see, the external table is query-able like any other object. You could use it in joins, views, stored procedures, etc.
Applying this in a business context
Continuing the above scenario, at this point, we could think about creating a view that makes a UNION ALL of SalesOrderHeader table with SalesOrdersExternal, and then deleting the exported records from SalesOrderHeader table. By this, we had effectively archived our sales data from 2011, and we could repeat the process with each next sales year, reducing growth of your SQL MI’s local storage for this table.
Conclusion
CETAS is an important block in the broader Data Virtualization topic.
We saw a couple of typical use cases and this article aims to get you introduced, explain the concept, and help you try the feature without getting errors along the way.
There are more advanced concepts to consider, such as performance optimizations, partitioned external tables, troubleshooting with Extended events (XEvents), and we will cover these aspects in the follow-up articles and link them here.
Let’s discuss in the comments, about your interests around this feature, how you are using it and what you’d like to read about next. If you want to suggest a feature improvement, we are actively monitoring the ideas portal for your input and requirements.
Thanks!
P.S. We've also prepared CETAS demo video for you.