Author(s): Ryan Adams is a Program Manager in Azure Synapse Customer Success Engineering (CSE) team.
There are two ways to read data inside Data Lake using the Synapse Serverless engine. In this article, we’ll look at the second method which uses an external table to query a path within the lake. To learn how to use the OPENROWSET command, check out part 1 of the article series.
Synapse is a collection of tools with four different analytical engines (Dedicated Pool, Spark Pool, Serverless Pool, Data Explorer Pool). This gives you a lot of options for ingesting, transforming, storing, and querying your data. Here you will use the Synapse Serverless Pool to query the data in your ADLS account.
Create a data source
Go to the Develop Hub and click the ellipsis that appears when you hover your cursor over the SQL Scripts heading. That will open a drop-down menu where you can select “New SQL Script”.
There are three major components we need to create before we can query our table. We need to create a Data Source, a File Format, and the External Table itself. Enter the code below into your script, make sure the “Connect To” box shows “Built-in” to use the serverless pool. You also need to select your database in the “Use Database” box, where the metadata will be stored for everything we are about to create. This first part will create the Data Source which defines the location in the data lake where the files to be queried are stored.
/* Create an external data source. You only do this once. */
CREATE EXTERNAL DATA SOURCE adls_temperature
LOCATION = 'https://MyLake.dfs.core.windows.net/iot/bronze/'
Create file format
Next, we need to create a file format. This lets the engine know what type of format it should expect for our files. In this example, the files are in Parquet format, but you could also use DELIMITEDTEXT format for CSV type files. There are quite a few more options required for CSV type files so check out the Full Documentation for creating an external file format.
/* Create an external file format. You only do this once. */
CREATE EXTERNAL FILE FORMAT parquet_file_format
FORMAT_TYPE = PARQUET
Create external table
Now we can finally create our table using the CREATE EXTERNAL TABLE command. In addition to defining our columns like we would normally do when creating a table, we also have a WITH clause. Here we input the names we used above for the Data Source and File Format along with the Location. The root or starting folder is what we defined in the Data Source, so the Location is a relative path to that root folder. You will also see two asterisks at the end of the location path which indicate to include every sub folder in that path. If you do not want to include sub folders, then simply remove the asterisks.
/* Create an external table. You only do this once. */
CREATE EXTERNAL TABLE Temperature(
LOCATION = 'temperature_raw/**',
DATA_SOURCE = adls_temperature,
FILE_FORMAT = parquet_file_format
Query External Table
Our setup is now complete, and our table is ready to query. Note that the Data Source can be reused with a different relative path for other folders under that root. The File Format can also be reused for Parquet files. Now it’s time to query our new External Table.
/* Query the new external table you created. All data is still stored in ADLS */
SELECT temp1, temp2, [ArrivalTime] FROM Temperature
ORDER BY [ArrivalTime] DESC;
Learn how to read Data Lake using Synapse Serverless external tables with this video.
In this article we learned how to use the Synapse Serverless engine to query data in our data lake. This is one of two ways to perform this task. The first method uses the OPENROWSET Command and is great for ad hoc queries. The method described in this article allows us to persist the metadata of the configuration for easier reuse. Now you don’t need to remember the path or format and can simply query the table like any other table in your database. To continue learning more about serverless, make sure to check out our documentation on Best Practices for Serverless SQL Pool.
Our team publishes blog(s) regularly and you can find all these blogs here: https://aka.ms/synapsecseblog
For deeper level understanding of Synapse implementation best practices, check out the Success by Design (SBD) site: https://aka.ms/Synapse-Success-By-Design