Best practices for integrating serverless SQL pool with Cosmos DB Analytical store via Synapse Link
Published Mar 16 2022 03:23 PM 7,903 Views
Microsoft

Serverless SQL pools enable you to implement near real-time analytics solutions on top of your Cosmos DB data. Serverless SQL pools with Synapse Link provides a cost-effective analytics solution for analyzing NoSQL data stored in Cosmos DB, allowing you to run heavy analytics on serverless SQL pools that will not affect your workload or price of your main Cosmos DB transactional store. Serverless SQL pools enables you to use T-SQL query language for analytics that connects your reporting & analytics tools (such as Power BI, Analytics Services) from a large ecosystem that works with SQL Server or Azure SQL database.

When you are integrating serverless SQL pools in your solution, you need to apply some best practices. There are general best practices for serverless SQL pools in the Synapse Analytics workspace, but some of these settings are not applicable to the Cosmos DB scenario. You will likely only use a subset of best practices which you can find here. In this post, you will only find best practices that you should apply in your Cosmos DB solution and some additional hints that could help you to optimize your solution.

 

Collocate your resources

In your solution, you will have at least one Cosmos DB account with enabled analytical storage, serverless SQL pool in the Synapse workspace, and clients such as Power BI or Analysis Service.

JovanPop_0-1647364926612.png

 

Make sure that your analytical store, Serverless SQL pool, and clients (Power BI, Analysis services) are placed in the same region. Placing some of these components in different regions would cause data transfer across the regions with high latency. If your clients are on-prem or on your local computer, make sure that you have good network bandwidth to transfer data.

Note that Synapse Studio is running in your browser, so that client is on your local computer. Do not expect excellent performance if you are using Synapse Studio – this is easy to use IDE, but it is not designed for returning a large amount of data and displaying it in the browser.

 

Use native tools for retrieving a large amount of data

If you are reading a large amount of data from Cosmos DB, make sure that you have good bandwidth (ideally some VM in Azure) and a tool that uses the latest native driver. SSMS or ADS are preferred client tools for retrieving a large amount of data (Synapse Studio is an easy-to-use tool, but uses HTTP protocol over a standard internet connection to retrieve the data).

 

Use UTF8 collation

Strings in the Cosmos DB are stored as Unicode UTF-8 encoded strings. Therefore, you should always return strings as VARCHAR types with UTF8 collation. Avoid using the NVARCHAR type for string because it will need to explicitly convert every UTF-8 string to UTF-16. The recommended type for representation of the string values stored in Cosmos DB is the VARCHAR type with some UTF collation. You can choose any of the UTF8 collations in serverless pools and control whether the string comparison would be case sensitive or insensitive, what language rules should be used for sorting/comparison (Latin, German, Serbian), etc.

If you are filtering data using a string column, consider using the Latin1_General_100_BIN2_UTF8 collation on these string columns. This collation might boost your filters because it uses aggressive filtering and file/segment eliminations in the Analytical store. The downside is the fact that this collation applies case-sensitive comparison and does not use any custom string comparison used in languages (accents, equivalent digraphs, etc.)

Learn more about the UTF-8 collations in serverless SQL pools here.

 

Optimize your database schema

Creating an optimal database model with proper column types is crucial in every database. Serverless SQL pools work with remote data, so any improvement or mistake in the database design is amplified. In a serverless SQL pool, it is even more important to apply some best practices than in the database model.

Make sure that you are using optimal schema and types. Organize your tables/views in star/snowflake model, pre-aggregate measures, avoid using VARCHAR(MAX) type for smaller strings, don’t use NVARCHAR when you are using UTF-8 encoded data, and use BIN2_UTF8 collation on the strings where you are doing the filtering.

You can easily identify potential improvements or problems in your database schema using the Query Performance Insight T-SQL library. You can simply install this library by executing this script. Once you run this script, you will be able to query some views that can provide recommendations on how to optimize your schema or troubleshoot issues. You can find the recommendation by querying qpi. recommendations view:

JovanPop_1-1647364926639.png

 

Learn more about schema optimization in serverless SQL pools in this article. There you can  also learn how to interpret the results of this query.

 

Optimize your queries

Make sure that your queries are optimized. Always use the WITH clause in the OPENROWSET function when you query Cosmos DB data. Without the WITH clause, the OPENROWSET function will guess the types and use sub-optimal types for the Cosmos DB data – for example, the VARCHAR(8000) type for the strings. Always explicitly specify the schema in the WITH clause with minimal types.

If you have object properties with nested object types, make sure that you are using the paths in the WITH clause instead of returning the objects as JSON and applying the JSON_VALUE function to extract the fields.

 

SELECT *
FROM OPENROWSET( … )
            WITH (  title         varchar(200)   '$.metadata.title',
                    description   varchar(1000)  '$.metadata.description',
                    author        varchar(100)   '$.metadata.author'
) AS docs;

 

Learn here how to query nested objects with the serverless SQL pool.

If you have a complex document, you can create T-SQL queries using the sample document – see generate Cosmos DB query by example.

JovanPop_2-1647364926658.png

 

Here you can simply paste one of the documents that you want to analyze and the page will propose the OPENROWSET function with the WITH clause that can parse this sample document.

If you use this page to generate the queries, make sure that you are optimizing the generated types. For example, every string type in the generated T-SQL query will be represented as a VARCHAR (256) type. You would probably need to find the proper size for each property.

If your documents are complex and need to parse nested sub-array types, you would need to flatten nested arrays using CROSS/OUTER APPLY OPENJSON syntax.

 

SELECT *
FROM OPENROWSET( … ) WITH (  authors  varchar(max) ) AS docs
        OUTER APPLY OPENJSON ( authors )
                    WITH ( name varchar(200), email varchar(200) ) AS authors

 

You should return the Cosmos DB array as the varchar(max) column and then pass that column to the OPENJSON function. In the WITH clause of the OPENJSON function, you can specify all fields in the array elements.

Paste a document with the nested array in the page that generates Cosmos DB query by example to see how the query would look in this case. Note that every CROSS APPLY OPENJSON step adds a performance overhead on the query execution. Applying multiple OPENJSON calls in the same query might impact performance.

You might need to work with arrays sometimes and return an index of each array element. Do not use something like a ROW_NUMBER function for this. The OPENJSON function enables you to get the index of every array element.

 

SELECT *
FROM OPENROWSET( … ) WITH (  authors  varchar(max) ) AS docs
              OUTER APPLY OPENJSON ( authors )
                          WITH ( author_id int '$.sql:identity()',
                                 name varchar(200), email varchar(200) ) AS authors

 

The column with $.sql:identity() path after the column type will have autogenerated zero-based indexes for every element in the array that the OPENJSON function parses.

See more details in this post.

 

Scale-out if needed

Serverless SQL pool automatically assigns a required compute power for the query execution. However, it has a compute limit per workspace that cannot be controlled. This might impact the performance of your concurrent queries because some of the queries that are running in parallel might need to wait for the other queries to finish and release the compute to get started. There are also some workspace constraints such as the amount of data that can be returned at the same time by all concurrent queries – find more info in the Synapse documentation page. If you have a heavy parallel workload that is using serverless pool in a workspace, your workload might hit this limit.

Note that the best practices explained in this post will maximize the performance of your workspace and will enable a serverless SQL pool to successfully handle most of the queries. If you optimize the queries and minimize the types, your workload will execute the queries with less compute power or need for results streaming. However, even if you achieve the best possible performance, you can always send a large number of queries that cannot be successfully handled with the compute power that is assigned to a single workspace.

You cannot manually scale up serverless SQL pool to add more compute power or result set size, so the only solution is to scale out the pools by adding more workspaces.

JovanPop_3-1647364926664.png

 

 

Note that there can be only one serverless SQL pool per Synapse workspace. Therefore, you would need to create separate workspaces and explicitly configure your client applications to use some of them. Note that there is a limited number of workspaces per subscription – you can find more information on the Synapse documentation page.

There is no additional cost for new workspaces. Workspaces are only charged when a query is executed, so there is no difference between executing 5 queries on a single workspace or 3 different workspaces.

 

Conclusion

Synapse link for Cosmos DB is a cost-effective solution for near real-time analytics on Cosmos DB data that enables any reporting/analytics tool in the T-SQL ecosystem to connect to Cosmos DB data and build reports without affecting the cost of the main Cosmos DB transactional database.

 

By following the best practices described in this article, you will optimize the performance of your queries and improve the experience in your tools.

A more exhaustive list of best practices can be found here, but the best practices described in this article are the most important optimizations that should be applied. If you still experience performance issues learn how to troubleshoot performance issues on the serverless SQL pools.

Co-Authors
Version history
Last update:
‎Mar 21 2022 01:36 PM
Updated by: