The purpose of this about is to discuss Managed and External tables while querying from SQL On-demand or Serverless.
Thanks to my colleague Dibakar Dharchoudhury for the really nice discussion related to this subject.
By the docs: Shared metadata tables - Azure Synapse Analytics | Microsoft Docs
Spark provides many options for how to store data in managed tables, such as TEXT, CSV, JSON, JDBC, PARQUET, ORC, HIVE, DELTA, and LIBSVM. These files are normally stored in the warehouse directory where managed table data is stored.
Spark also provides ways to create external tables over existing data, either by providing the LOCATION option or using the Hive format. Such external tables can be over a variety of data formats, including Parquet.
Azure Synapse currently only shares managed and external Spark tables that store their data in Parquet format with the SQL engines
Note "The Spark created, managed, and external tables are also made available as external tables with the same name in the corresponding synchronized database in serverless SQL pool."
Following examples of Managed and External Tables created on Spark:
blob_account_name = "StorageAccount"
blob_container_name = "ContainerName"
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
blob_sas_token = token_library.getConnectionString("LInkedServerName")
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
Note my linked Server Configuration:
2) Managed table: <I updated this inf. after the post.>
Spark.sql('CREATE DATABASE IF NOT EXISTS SeverlessDB')
#THE BELOW MANAGED SPARK TABLE
df = spark.read.load(filepath, format='parquet')
3) I can also create a managed table as parquet using the same dataset that I used for the previous one as follows:
#Managed - table
Query from Serverless:
Following the documentation. This is another way to achieve the same result for the managed table, however in this case the table will be empty:
CREATE TABLE SeverlessDB.myparquettable(id int, name string, birthdate date) USING Parquet
4) External table:
CREATE TABLE SeverlessDB.myexternalparquettable
Those are the commands supported to create managed and external tables on Spark per doc. that would be possible to query on SQL Serverless.
If you check the path where your managed table was created you will be able to see under the Data lake as follows. For example, my workspace name is synapseworkspace12:
5) You can use describe command to check about the type of your tables. For example:
DESCRIBE FORMATTED SeverlessDB.myexternalparquettable
If you want to clean up this lab - Spark SQL:
-- Drop the database and it's tables
DROP DATABASE SeverlessDB CASCADE
That is it!