Public Preview: Separation of scan levels for Azure SQL Database and Snowflake in Microsoft Purview

Henry_Shi · ‎Dec 27 2023

Scanning is a key function that captures metadata from data sources and brings it to Microsoft Purview. In Microsoft Purview Data Map terminology, there are three different levels of scanning based on the metadata scope and functionalities:

L1 scan: Extracts basic information and metadata like file name, size, and fully qualified name
L2 scan: Extracts schema for structured file types and database tables
L3 scan: Extracts schema where applicable and subjects the sampled file to the system and custom classification rules

So far the different scan levels are integral parts of the scanning process that can’t be further separated, for example, users can’t disable the data sampling and classification on data sources that already support classification in Microsoft Purview. To support more usage scenarios and increase flexibility, we will start to add the support of separation of scan levels for different data sources and we are pleased to announce the public preview of this feature for scanning Azure SQL Database and Snowflake. Below we will use scanning Azure SQL Database as an example to introduce this new feature and show some sample scenarios.

Overview of the feature

When data source administrators set up a new scan or edit an existing scan for Azure SQL Database, they will find a new feature available with which the scan level can be customized:

By default, the “Auto detect” will be selected which means Microsoft Purview will apply the highest scan level available for this data source. For Azure SQL Database, the “Auto detect” will be resolved as “Level-3” when the scan is executed as it has already supported classification. The scan level in the scan run detail will show the actual level applied.

For all scan runs in the scan history which were completed before the feature is introduced, by default the scan level will be set and displayed as “Auto detect”.

When a higher scan level becomes available for a data source, the saved or scheduled scans that have scan level set to “Auto detect” will automatically apply the new scan level. For example, if classification as a new feature is enabled for a given data source, all existing scans on this data source will apply classification automatically.
The scan level setting will show in the scan monitoring interface for each scan run.
If “Level-1” is selected, scanning will only return basic technical metadata like asset name, asset size, modified timestamp etc. based on the existing metadata availability of a specific data source. For Azure SQL Database, asset entities like tables will be created in Microsoft Purview Data Map but without table schema extraction. (Note: users can still see the table schema via live view if they have necessary permissions in the source system).
If “Level-2” is selected, scanning will return table schemas as well as basic technical metadata, but data sampling and classification will not be performed. For Azure SQL Database, table asset entities will have table schema captured without classification information.)
If “Level-3” is selected, scanning will perform the data sampling and classification. This is a standard configuration for Azure SQL Database scanning before scan level as a new feature is introduced.
If a scheduled scan is set to a lower scan level and later modified to a higher scan level, the next scan run will automatically perform a full scan and all existing data assets from the data source will be updated with metadata introduced by a higher scan level setting. For example, when a scheduled scan set with “Level-2” on an Azure SQL Database is changed to “Level-3”, the next scan run will be a full scan and all existing Azure SQL Database table/view assets will be updated with classification information, and all scans thereafter will resume as incremental scans set with “Level-3”.
If a scheduled scan is set to a higher scan level and later modified to a lower scan level, the next scan run will continue to perform an incremental scan and all new data assets from the data source will only have metadata introduced by a lower scan level setting. For example, when a scheduled scan set with “Level-3” on an Azure SQL Database is changed to “Level-2”, the next scan run will be an incremental scan and all new Azure SQL Database table/view assets added in Microsoft Purview Data Map will have no classification information. All existing data assets will still keep the classification information generated from the previous scan set with “Level-3”.

Sample usage scenarios

1. Quickly profile a data source like an Azure SQL Database

Users want to scan a data source to develop a general understanding of the data assets in the source data system. For example, data source administrators can set the scan level as “Level-1” when scanning an Azure SQL Database to get information like number of tables, table distribution under different schemas etc. The scan run set with “Level-1” will run faster and save time and cost on a large data source as no schema extraction, data sampling and classification will be performed.

2. Capture table schema from Azure SQL Database for data discovery

Users want to enable data discovery based on the table/view schemas without the need to identify sensitive information. For example, data source administrators can set the scan level as “Level-2” when scanning an Azure SQL Database, the table/view schema information will be captured and data consumers can discover, annotate and manually classify the table/view columns in the Microsoft Purview Data Catalog.

3. Reduce the workload in source data system introduced by classification

Users want to reduce the workload introduced by classification in their source data systems as data classification applies data sampling in the source like sampling Azure SQL Database tables/views. Data source administrators may schedule and run a scan set with “Level-2” with lower workload impact, later change the scan level to “Level-3” or run a separate scan set with “Level-3” to perform classification as needed.

4. Enable classification to detect sensitive information and govern the data

This is the most common scanning scenario applied for all data sources with classification support in Microsoft Purview today.

Available resources and limitations

Currently this feature is only available for Azure SQL Database and Snowflake on Azure IR and Managed VNet IR v2. The support for more sources and integration runtime will come in the future.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Public Preview: Separation of scan levels for Azure SQL Database and Snowflake in Microsoft Purview