Serverless SQL pools enable you to read data from CosmosDB or Dataverse analytical storage using the Synapse Link. Analytical storage is an internal read-only copy of CosmosDB/Dataverse data designed for analytics purposes.
The analytical store is a cost-effective solution for near-real-time analytics that enables you to read data from the internal columnar storage where CosmosDB and Dataverse are continuously exporting a copy of the operational data. Only some services, such as serverless SQL pools and Apache Spark pools in Synapse workspace, can use Synapse Link to directly access CosmosDB/Dataverse analytical storage.
At the time of writing this article, the dedicated SQL pool doesn’t have the ability to read data from CosmosDB/Dataverse using the Synapse link. There are scenarios where you would need to use CosmosDB data in your dedicated SQL pool, so you would need to find a way how to load data. In theory, you could create an ADF pipeline that reads data from CosmosDB or Dataverse and store data in the dedicated SQL pool as a target. This might be a problem if your Pipeline is reading data directly from CosmosDB account because it might impact both operational workload performance and cost. The analytical storage is the recommended location that you should use to fetch all data from CosmosDB/Dataverse.
In this post, I will describe how to use a two-step approach where you export your data using the serverless SQL pool via Synapse link into Azure Data Lake storage, and then load data into the dedicated SQL pool table. This process is shown in the following figure:
There are two main paths in this process:
In this article, I will show how to copy data from CosmosDB analytical storage, but a similar script could be used for the Dataverse.
As a first step, we need to read data using the Synapse Link.
The T-SQL code is shown in the following picture:
If you can fetch 10 records via Synapse link, you are ready to export data to ADLS.
Now you can use the CETAS statement to export the results of the SELECT query to the Azure Data Lake Storage account.
The sample script is shown in the following picture:
The CETAS statement will read all rows from the CosmosDB container and store the results as Parquet files with snappy compression.
As a final step, you need to go to your dedicated SQL pool and load data from the folder that you created in the previous step. If you are using Synapse Studio, you can easily generate the Bulk load statement:
Just select the ADLS folder where you exported data, generate T-SQL script, and add the pool, database, and table information. If you load data using SQL script, you will see that this action generated COPY INTO command similar to the one in the following sample:
COPY INTO dbo.covid
(_rid 1, _etag 2, _ts 3, month 4, day 5, date_rep 6, year 7, countries_and_territories 8, load_date 9, cases 10, deaths 11, geo_id 12, country_territory_code 13, continent_exp 14, iso_country 15, id 16)
FROM 'https://<storage account name>.dfs.core.windows.net/cetas/cosmosdb-covid'
WITH
(
FILE_TYPE = 'PARQUET'
,MAXERRORS = 0
,COMPRESSION = 'snappy'
)
This action will analyze the schema of the exported Parquet files, create target table (with a clustered Columnstore index if you selected this option) and then generate the COPY INTO statement that loads data into the table.
As a final step, you can read data from the dedicated SQL pool table where you loaded the data.
From this point you can use data that you loaded for CosmosDB container in dedicated SQL pool table.
Although the dedicated SQL pools cannot access CosmosDB and Dataverse using Synapse link, you can use serverless SQL pools (or Spark) to get the data via Synapse Link, export it into Azure Data Lake storage, and import it into the dedicated pool.
In the following sections you can find T-SQL scripts used in this post. Most of the script are automatically generated using the Synapse Studio.
CREATE CREDENTIAL [synapselink-cosmosdb-sqlsample]
WITH IDENTITY = 'SHARED ACCESS SIGNATURE',
SECRET = 's5zarR2pT0JWH9k8roipnWxUYBegOuFGjJpSjGlR36y86cW0GQ6RaaG8kGjsRAQoWMw1QKTkkX8HQtFpJjC8Hg=='
SELECT TOP 100 *
FROM OPENROWSET(PROVIDER = 'CosmosDB',
CONNECTION = 'Account=synapselink-cosmosdb-sqlsample;Database=covid',
OBJECT = 'Ecdc',
SERVER_CREDENTIAL = 'synapselink-cosmosdb-sqlsample'
) AS [Ecdc]
GO
CREATE EXTERNAL DATA SOURCE LinkExportLocation WITH (
LOCATION = 'https://<storage account name>.dfs.core.windows.net/cetas/'
--, CREDENTIAL = --> optionally, create a credential with write permission
);
GO
CREATE EXTERNAL FILE FORMAT [ParquetFF] WITH (
FORMAT_TYPE = PARQUET,
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
);
GO
CREATE EXTERNAL TABLE [dbo].[CovidCETAS] WITH (
LOCATION = 'cosmosdb-covid/',
DATA_SOURCE = LinkExportLocation,
FILE_FORMAT = [ParquetFF]
) AS
SELECT * FROM OPENROWSET( PROVIDER = 'CosmosDB',
CONNECTION = 'Account=synapselink-cosmosdb-sqlsample;Database=covid',
OBJECT = 'Ecdc',
SERVER_CREDENTIAL = 'synapselink-cosmosdb-sqlsample' ) AS [Ecdc]
GO
DROP EXTERNAL TABLE [dbo].[CovidCETAS]
IF NOT EXISTS (SELECT * FROM sys.objects O JOIN sys.schemas S ON O.schema_id = S.schema_id WHERE O.NAME = 'covid' AND O.TYPE = 'U' AND S.NAME = 'dbo')
CREATE TABLE dbo.covid
(
[_rid] nvarchar(4000),
[_etag] nvarchar(4000),
[_ts] bigint,
[month] bigint,
[day] bigint,
[date_rep] nvarchar(4000),
[year] bigint,
[countries_and_territories] nvarchar(4000),
[load_date] nvarchar(4000),
[cases] bigint,
[deaths] bigint,
[geo_id] nvarchar(4000),
[country_territory_code] nvarchar(4000),
[continent_exp] nvarchar(4000),
[iso_country] nvarchar(4000),
[id] nvarchar(4000)
)
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
-- HEAP
)
GO
--Uncomment the 4 lines below to create a stored procedure for data pipeline orchestration
--CREATE PROC bulk_load_covid
--AS
--BEGIN
COPY INTO dbo.covid
(_rid 1, _etag 2, _ts 3, month 4, day 5, date_rep 6, year 7, countries_and_territories 8, load_date 9, cases 10, deaths 11, geo_id 12, country_territory_code 13, continent_exp 14, iso_country 15, id 16)
FROM 'https://<storage account name>.dfs.core.windows.net/cetas/cosmosdb-covid'
WITH
(
FILE_TYPE = 'PARQUET'
,MAXERRORS = 0
,COMPRESSION = 'snappy'
)
--END
GO
SELECT TOP 100 * FROM dbo.covid
GO
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.