Loading CosmosDB and Dataverse data into dedicated SQL pool(DW) using the Synapse link
Published Feb 02 2022 03:23 PM 5,583 Views
Microsoft

Serverless SQL pools enable you to read data from CosmosDB or Dataverse analytical storage using the Synapse Link. Analytical storage is an internal read-only copy of CosmosDB/Dataverse data designed for analytics purposes.

JovanPop_0-1643797872964.png

 

The analytical store is a cost-effective solution for near-real-time analytics that enables you to read data from the internal columnar storage where CosmosDB and Dataverse are continuously exporting a copy of the operational data. Only some services, such as serverless SQL pools and Apache Spark pools in Synapse workspace, can use Synapse Link to directly access CosmosDB/Dataverse analytical storage.

At the time of writing this article, the dedicated SQL pool doesn’t have the ability to read data from CosmosDB/Dataverse using the Synapse link. There are scenarios where you would need to use CosmosDB data in your dedicated SQL pool, so you would need to find a way how to load data. In theory, you could create an ADF pipeline that reads data from CosmosDB or Dataverse and store data in the dedicated SQL pool as a target. This might be a problem if your Pipeline is reading data directly from CosmosDB account because it might impact both operational workload performance and cost. The analytical storage is the recommended location that you should use to fetch all data from CosmosDB/Dataverse.

In this post, I will describe how to use a two-step approach where you export your data using the serverless SQL pool via Synapse link into Azure Data Lake storage, and then load data into the dedicated SQL pool table. This process is shown in the following figure:

 

JovanPop_1-1643797872971.png

There are two main paths in this process:

  1. Exporting data from CosmosDB or Dataverse into Azure Data Lake Storage account using the CETAS functionality in the serverless SQL pool. Note that you can do the same action with Apache Spark pools in Synapse Analytics.
  2. Loading exported data from ADLS into a dedicated pool using COPY INTO command. As an alternative, you could use ADF to load the exported files from the storage into the dedicated pool.

In this article, I will show how to copy data from CosmosDB analytical storage, but a similar script could be used for the Dataverse.

 

Reading CosmosDB data using the serverless SQL pool

As a first step, we need to read data using the Synapse Link.

  1. Create server-level or database-scoped credential that contains CosmosDB read-only key that will be used to read data.
  2. Read sample data from CosmosDB using the OPENROWSET function.

The T-SQL code is shown in the following picture:

JovanPop_2-1643797873004.png

 

If you can fetch 10 records via Synapse link, you are ready to export data to ADLS.

 

Exporting CosmosDB into Azure Data Lake storage

Now you can use the CETAS statement to export the results of the SELECT query to the Azure Data Lake Storage account.

  1. Make sure that your account has permission to write into ADLS or get a SAS key that enables writing data into ADLS.
  2. Create a Parquet file format that would be used to specify how to export data (for example Parquet with snappy compression).
  3. Create an external data source that is pointing to the location where you want to export the results. Don’t pre-create the folder – just reference the folder location where CETAS will create the leaf folder with the exported results.
    1. The data source should have a credential if you cannot use your Azure AD identity to write to storage.
  4. Execute the CETAS command that will store the results of the query that reads data from the Analytical storage on the external data source location.

The sample script is shown in the following picture:

JovanPop_3-1643797873016.png

The CETAS statement will read all rows from the CosmosDB container and store the results as Parquet files with snappy compression.

 

Loading ADLS data into the dedicated pool

As a final step, you need to go to your dedicated SQL pool and load data from the folder that you created in the previous step. If you are using Synapse Studio, you can easily generate the Bulk load statement:

JovanPop_4-1643797873023.png

 

Just select the ADLS folder where you exported data, generate T-SQL script, and add the pool, database, and table information. If you load data using SQL script, you will see that this action generated COPY INTO command similar to the one in the following sample:

 

 

COPY INTO dbo.covid
(_rid 1, _etag 2, _ts 3, month 4, day 5, date_rep 6, year 7, countries_and_territories 8, load_date 9, cases 10, deaths 11, geo_id 12, country_territory_code 13, continent_exp 14, iso_country 15, id 16)
FROM 'https://<storage account name>.dfs.core.windows.net/cetas/cosmosdb-covid'
WITH
(
    FILE_TYPE = 'PARQUET'
    ,MAXERRORS = 0
    ,COMPRESSION = 'snappy'
)

 

This action will analyze the schema of the exported Parquet files, create target table (with a clustered Columnstore index if you selected this option) and then generate the COPY INTO statement that loads data into the table.

As a final step, you can read data from the dedicated SQL pool table where you loaded the data.

JovanPop_5-1643797873031.png

 

From this point you can use data that you loaded for CosmosDB container in dedicated SQL pool table.

 

Conclusion

Although the dedicated SQL pools cannot access CosmosDB and Dataverse using Synapse link, you can use serverless SQL pools (or Spark) to get the data via Synapse Link, export it into Azure Data Lake storage, and import it into the dedicated pool.

In the following sections you can find T-SQL scripts used in this post. Most of the script are automatically generated using the Synapse Studio.

Exporting into Azure Data Lake storage

 

CREATE CREDENTIAL [synapselink-cosmosdb-sqlsample]

WITH IDENTITY = 'SHARED ACCESS SIGNATURE',

SECRET = 's5zarR2pT0JWH9k8roipnWxUYBegOuFGjJpSjGlR36y86cW0GQ6RaaG8kGjsRAQoWMw1QKTkkX8HQtFpJjC8Hg=='



SELECT TOP 100 *

FROM OPENROWSET(​PROVIDER = 'CosmosDB',

                CONNECTION = 'Account=synapselink-cosmosdb-sqlsample;Database=covid',

                OBJECT = 'Ecdc',

                SERVER_CREDENTIAL = 'synapselink-cosmosdb-sqlsample'

) AS [Ecdc]



GO



CREATE EXTERNAL DATA SOURCE LinkExportLocation WITH (

    LOCATION = 'https://<storage account name>.dfs.core.windows.net/cetas/'

    --, CREDENTIAL = --> optionally, create a credential with write permission

);

GO

CREATE EXTERNAL FILE FORMAT [ParquetFF] WITH (

    FORMAT_TYPE = PARQUET,

    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'

);

GO

CREATE EXTERNAL TABLE [dbo].[CovidCETAS] WITH (

        LOCATION = 'cosmosdb-covid/',

        DATA_SOURCE = LinkExportLocation,

        FILE_FORMAT = [ParquetFF]

) AS

SELECT * FROM OPENROWSET(   ​PROVIDER = 'CosmosDB',

                            CONNECTION = 'Account=synapselink-cosmosdb-sqlsample;Database=covid',

                            OBJECT = 'Ecdc',

                            SERVER_CREDENTIAL = 'synapselink-cosmosdb-sqlsample' ) AS [Ecdc]



GO

DROP EXTERNAL TABLE [dbo].[CovidCETAS]


 

 

 

Loading into dedicated SQL pool

 

IF NOT EXISTS (SELECT * FROM sys.objects O JOIN sys.schemas S ON O.schema_id = S.schema_id WHERE O.NAME = 'covid' AND O.TYPE = 'U' AND S.NAME = 'dbo')

CREATE TABLE dbo.covid

    (

     [_rid] nvarchar(4000),

     [_etag] nvarchar(4000),

     [_ts] bigint,

     [month] bigint,

     [day] bigint,

     [date_rep] nvarchar(4000),

     [year] bigint,

     [countries_and_territories] nvarchar(4000),

     [load_date] nvarchar(4000),

     [cases] bigint,

     [deaths] bigint,

     [geo_id] nvarchar(4000),

     [country_territory_code] nvarchar(4000),

     [continent_exp] nvarchar(4000),

     [iso_country] nvarchar(4000),

     [id] nvarchar(4000)

    )

WITH

    (

    DISTRIBUTION = ROUND_ROBIN,

     CLUSTERED COLUMNSTORE INDEX

     -- HEAP

    )

GO



--Uncomment the 4 lines below to create a stored procedure for data pipeline orchestration​

--CREATE PROC bulk_load_covid

--AS

--BEGIN

COPY INTO dbo.covid

(_rid 1, _etag 2, _ts 3, month 4, day 5, date_rep 6, year 7, countries_and_territories 8, load_date 9, cases 10, deaths 11, geo_id 12, country_territory_code 13, continent_exp 14, iso_country 15, id 16)

FROM 'https://<storage account name>.dfs.core.windows.net/cetas/cosmosdb-covid'

WITH

(

    FILE_TYPE = 'PARQUET'

    ,MAXERRORS = 0

    ,COMPRESSION = 'snappy'

)

--END

GO



SELECT TOP 100 * FROM dbo.covid

GO

 



 

Co-Authors
Version history
Last update:
‎Feb 02 2022 03:08 AM
Updated by: