How Azure Synapse Analytics enables you to run T-SQL queries over your Data Lake files

JovanPop · ‎Jun 08 2020

One of the important features of Azure Synapse Analytics is the ability to analyze the files placed on Azure Data Lake Storage (ADLS) and Azure Blob Storage using Transact-SQL language. Azure Synapse provides provisioned and serverless SQL endpoints that enable you to query large amounts of data using well-known Transact-SQL language. In this article you will learn how to leverage Synapse SQL serverless endpoint to explore your files, build a relational layer over your storage files, and use SQL serverless features to analyze your files and implement security mechanism that controls access to your data.

Querying files on Azure storage

There are two main scenarios that you will use to analyze the files on Azure storage:

Ad-hoc data exploration and analytics where you can read the content of files or explore file schema using OPENROWSET function.
Designing logical data warehouse (LDW) that represents a thin logical relational layer over your Azure storage. LDW enables you to create external tables and define data access rules on top of your storage files. LDW is a logical adapter for any tool that can use Transact-SQL language and that need to access the files instead of database tables.

Tools and applications that can execute Transact-SQL queries can read the content of files without noticing the difference between externally stored and managed data. OPENROWSET function, external tables, and views represent abstractions on top of physical files that provide expected relational interface on externally stored data.

You can use Azure Synapse Studio to query your files using web interface, or desktop tools such as Azure Data Studio or SQL Server Management Studio. The tools like Power BI workspace or Power BI Desktop that use Transact-SQL language supported in serverless SQL endpoint in Azure Synapse analytics can read content of files and create the reports on files.

You can analyze various file formats such as CSV, PARQUET (including support for nested types), and JSON. You can either read single file, specify complex file pattern with wildcards to read data from multiple files within a storage, and even optimize performance using partition elimination with built-in file metadata functions FILEPATH and FILENAME. These are new functionalities introduced in Transact-SQL language that you can easily integrate with other standard SQL concepts such as views.

Securing the access to your files

You can use advanced security mechanisms to control who can access your files and control access to your workspace, database objects, and data. The first security layer in any mission critical system is network security control. You can configure IP firewall rules to control what are the IP addresses that can access your service, or setup Managed private endpoints that provide private IP addresses that can be accessed from within your VNet.

Once you define your network security strategy, you can allow users to authenticate in order to run the queries. User can authenticate using Azure AD authentication with their own Azure identities or SQL authentication using simple username and password. Azure AD authentication is advanced authentication option that provides centralized user management and advanced security features such as Multi-factor authentication.

You can define advanced permission rules and control what external tables and views can be used to access the storage. You can use granular permission model to GRANT or DENY access to some external tables, the groups of tables organized in schemas, or entire database. You can also GRANT or DENY access to individual users or the groups of users organized in database roles. This is flexible and proven enterprise-grade security model that is used in Azure SQL Database and SQL Server database engines. You can also enable ad-hoc storage data analysts to directly read content of files by granting them permission to execute OPENROWSET function on particular storage accounts.

Third layer of security is storage access control where you can define impersonation mechanism that database users can use to access the files on storage. The recommended impersonation mechanism is Azure AD passthrough authentication where the identity of Azure AD user is directly provided to Azure storage. This mechanism enables your Storage administrators to define who can access files on storage layer by assigning Azure storage RBAC roles such as Storage Data Reader role. As an alternative, your Storage administrator can provide you Shared Access Signature that define time-limited access to some files in storage. The third option is to allow workspace to access files and impersonate the database user as using Azure Synapse workspace identity. This option enables your Storage owner to allow access to single identity and let you fully control access using database permission model.

Serverless consumption model

Serverless SQL queries in Azure Synapse Analytics are charged using pay-per-use consumption model. There is no pre-provisioning of resources, so you don’t need to scale up or down your compute depending on your workload. Azure Synapse will dynamically allocate compute needed to process the data depending on your query complexity and data size, and you are paying only for processed data. See the pricing page for more details. Metadata queries and actions where you setup credentials and tables are not charged. Serverless querying leverages full potential of cloud elasticity in Azure and delivers the main promise of cloud data computing where you can forget about the resource constrains and just use your service when you need it.

Conclusion

The SQL serverless query engine in Azure Synapse Analytics is an excellent choice in the scenarios where you need to combine Transact-SQL language and big data analytics. Transact-SQL enables data analysts who are already familiar with SQL language to explore the Azure storage files with minimal required learning curve. Logical data warehouse on top of Azure storage is a relational interface that enables the tools and applications that can query databases to transparently access Azure storage using the proven security model. In addition, serverless consumption model enables you to forget about resource provisioning, manual scaling, and start querying the data and paying only for the resources you used. You can learn more about this service on Azure documentation and create your first workspace to see how it fits in your analytical solution.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs