Learn about STRING_AGG, OFFSET/FETCH, and PIVOT/UNPIVOT and other features in serverless SQL pool

Published Dec 15 2020 08:58 AM 2,844 Views
Microsoft

Serverless Synapse SQL pools in Azure Synapse Analytics have a new set of features that will enable you to analyze your Azure data more efficiently. The new Transact-SQL (T-SQL) language features that you can use in serverless SQL pools are STRING_AGG, OFFSET/FETCH, PIVOT/UNPIVOT, SESSION_CONTEXT, and CONTEXT_INFO.

In this article, you can learn about the scenarios where you can use these features to analyze your Azure data. We will use the European Centre for Disease Prevention and Control (ECDC) COVID-19 Cases data set to showcase these features. The ECDC data set is placed on a publicly available Azure Data Lake storage.

 

Easy pagination with OFFSET/FETCH

Pagination is one common feature that must be implemented in various reports and applications. The T-SQL language has the OFFSET and FETCH clauses that enable you to limit the number of rows returned by a query. The following query reads data from the ECDC data set, ignores the first 30 rows, and returns 10 rows on the 4th 10-row page:

select countries_and_territories, year, month, day, cases
from openrowset(
    bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',
    format = 'parquet') as rows
order by year, month, day 
OFFSET 30 ROWS
FETCH NEXT 10 ROWS ONLY;

The OFFSET and FETCH clauses can simplify the queries that you need to write to create paginated reports.

 

Concatenates the values in a group using STRING_AGG

The STRING_AGG aggregate enables you to concatenate the values that you grouped by some criterion and return them as a list. The values in the list are separated using a separator character that is defined in the STRING_AGG aggregate.

The following query will group information from the ECDC data set by country, year, and month, and then return a comma-separated list of the cases reported within each group:

select top 10  geo_id, year, month, cases = string_agg(cases, ',') 
from openrowset(
    bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',
    format = 'parquet') as rows
group by geo_id, year, month;

JovanPop_0-1608051022986.png

 

You can combine STRING_AGG with other functions to build some custom output format. As an example, the following query returns the daily cases formatted as JSON (Java Script Object Notation) documents:

select top 10 geo_id, year, month, cases = '{' + string_agg(concat('"',day,'":',cases), ',')
                        within group (order by day asc) + '}'
from openrowset(
    bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',
    format = 'parquet') as rows
group by geo_id, year, month;

JovanPop_1-1608051023002.png

 

Transform rows using pivot/unpivot

The PIVOT and UNPIVOT operators transpose rows and columns from a result set. They might merge a set of rows into a single row or expand a single row as a set of rows.

The PIVOT operator converts every row from a result set into multiple rows with a reduced number of columns. As an example, the PIVOT operator can be applied on a result set that returns the daily reported COVID cases from the ECDC data set. In this data set, each row holds the number of new cases reported per day and per country:

JovanPop_2-1608051023011.png

Let’s imagine that we want to transform the groups of rows that hold data in the first 7 days in a month and return them as single rows per group. The output rows should have these values placed in seven separate columns. The values that are identical across the rows in the group (countries_and_territories, year, and month columns) will be returned as the cells in a single output row. Seven values in the cases column will be returned as seven additional columns in the output row. The names of the columns will match the values in day column (in this example - [1], [2], [3], [4], [5], [6], and [7]):

with cases as ( 
    select countries_and_territories, year, month, day, cases
    from openrowset(
    bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',
    format = 'parquet') as rows
)
select top 10 *
from cases
PIVOT (
      SUM (cases) FOR day IN ( [1], [2], [3], [4], [5], [6], [7])  
) AS months  

JovanPop_3-1608051023017.png

 

The UNPIVOT operator could be applied on a result set where each row has multiple columns like the number of cases and the number of deaths in the ECDC data set:

JovanPop_4-1608051023024.png

Let’s imagine that we don’t want to return information about the cases and deaths in the separate columns of a single row. Instead, we would like to have separate rows where one holds the number of reported cases and another the number of fatal cases. The UNPIVOT operator will take each row and return two rows where the column names will be placed in a single column that represents property names and the cell values will be placed in the separate columns:

with cases as ( 
    select countries_and_territories, year, month, day, cases, deaths
    from openrowset(
    bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',
    format = 'parquet') as rows
    where continent_exp = 'Europe'
    and year = 2020
)
select top 100 countries_and_territories, year, month, day, property, value
from cases
UNPIVOT (
      value FOR property IN ( [cases], [deaths] )  
) AS unpivoted  

JovanPop_5-1608051023036.png

Instead of a set of columns per each row in the original result set, there are two new columns:

  • Property column that contains the name of the cell in the original row.
  • Value column that contains the value of the cell in the original row.

 

Session context

Session context system function enables you to store key:value pairs within the SQL session. These values can be used in the predicate of some views or stored procedures:

CREATE OR ALTER VIEW cases
AS SELECT * FROM openrowset(
    bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet',
    format = 'parquet') as rows
    WHERE continent_exp = CAST(SESSION_CONTEXT(N'continent') AS VARCHAR(8000))

    

To get the cases reported in Europe, you can set the continent value to ‘Europe’ using the following example:

exec sp_set_session_context 'continent', 'Europe';

SELECT TOP 10 * FROM cases

To return the cases reported in Asia you can change the values of the session context variable and read data without changing the view:

exec sp_set_session_context 'continent', 'Asia' 

SELECT TOP 10 * FROM cases

The view will return different results based on the context that the caller set. CONTEXT_INFO is a similar session-level variable that can contain a single value.

Conclusion

Serverless Synapse SQL pools in Azure Synapse Analytics provide a rich surface area and enables data analysts with SQL skills to easily analyze Azure data. New features that are available in T-SQL language enables you to analyze data and create more powerful reports more easily. You can find more information about the features that are available in Synapse SQL pools in this article.

Version history
Last update:
‎Dec 15 2020 08:58 AM
Updated by: