Azure Synapse Analytics is a limitless data analytics service that enables you to analyze data on Azure Data Lake storage. It provides managed Apache Spark and T-SQL engines (provisioned and serverless) for analyzing data.
In this article, you will see how you can create a table that references data on external Azure Data Lake storage in order to enable the client applications such as Power BI to create reports on Data Lake information. The data set is placed on Azure storage and formatted as parquet, but client applications can access these data as any other table without need to know where is stored actual data.
To try this sample, you need to have Azure Synapse Analytics workspace. If you don’t have one you can easily deploy it using Azure portal or this Deployment template. Workspace automatically deploys one serverless Synapse SQL endpoint that is everything we need for this kind of analysis. With the workspace you are also getting easy-to-use Web UI called Synapse Studio that enables you to start analyzing the files directly from your browser.
NOTE: You need Synapse SQL serverless (on-demand) query endpoint to execute the code in this article.
As a first step you need to configure data source and specify file format of remotely stored files.
CREATE EXTERNAL DATA SOURCE ecdc_cases WITH (
LOCATION = 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/'
CREATE EXTERNAL FILE FORMAT ParquetFormat WITH ( FORMAT_TYPE = PARQUET );
First statement creates data source that references ECDC COVID data set, while the second specifies parquet file format.
Exploring file schema
Now you need to determine what are the columns in the external files and what are their types. In the previous article you might have seen that OPENROWSET function enables you to quickly explore data in the files placed on Azure storage. We can also use sp_describe_first_result_set function to identify the schema that will be used for the table by providing the OPENROWSET data exploration query to this procedure:
The most important columns are name and system_type_name that we can use to create schema of external table that references this file. The easiest way to use these information to create a table is to export data in Excel, hide the columns between name and system_type_name and copy the values:
Creating an external table
Now we have all elements required to create a table. I will create one schema (this is optional but recommended), paste the results that I copied from Excel, and reference data_source/file_format that I created in the previous step:
The only additional information that I need to provide is the location of the files that this external table references.
IMPORTANT: You need to run this statement in some database other than master! Master database don’t allow you to create objects (schemas, tables).
Now, we can read the content of external files using this table:
The query is executed via Synapse Studio, but I can use any other tool like PowerBI, Excel, SSMS, ADS to query my external table.
External tables in Azure Synapse SQL query engine represent logical relational adapter created on top of externally stored files that can be used by any application that use TSQL to query data. This way you can build a Logical Data Warehouse on top of your data stored in Azure Data Lake without need to load data in standard relational table. Azure Synapse SQL Logical Data Warehouse enables you to represent external data sources as standard tables and let you analytic/reporting applications access any data without need to know where the data is place and hot to parse the data structure.