Azure Synapse Analytics Blog

6 MIN READ

Linear regression analysis on COVID data using SQL endpoint in Azure Synapse Analytics

Microsoft

Jun 16, 2020

Azure Synapse Analytics is a big data analytics service that enables you to analyze data on Azure storage using SQL language. You can use SQL endpoints (provisioned and serverless) to analyze your data using SQL language.

SQL is very powerful analytic language that can be used to implement performant analysis on big data. As an example, you can write a SQL query that creates a linear regression model on your datasets. Phil Factor described how to implement simple linear regression using SQL language.

In this article, you will see how you can create simple linear regression model to determine COVID mortality rate using Azure Open Dataset and Synapse SQL endpoint in Synapse Analytics.

Prerequisite

To try this sample, you need to have Azure Synapse Analytics workspace. If you don’t have one you can easily deploy it using Azure portal or this Deployment template. Synapse workspace automatically deploys one serverless Synapse SQL endpoint that is everything we need for this kind of analysis. With the workspace you are also getting easy-to-use web UI called Synapse Studio that enables you to start analyzing files in your browser.

NOTE: You need Synapse SQL serverless (on-demand) query endpoint to execute the code in this article. The functionalities used in this article are still not available in provisioned  endpoint.

COVID dataset

In this sample is used the latest available public data on COVID-19 cases worldwide from the European Center for Disease Prevention and Control (ECDC). Each row/entry contains the number of new cases reported per day and per country. For more information about this dataset, see  here. Dataset is updated on daily basis and placed as a part of Azure Open Dataset.

OPENROWSET function

Synapse SQL endpoint in Synapse Analytics enables you to use OPENROWSET T-SQL function to analyze data on Azure Data Lake storage. This is easy to use function where you need to provide URL of some publicly available file on Azure storage and you will get the content of the file. Example of OPENROWSET function that reads the content of a parquet file placed on Azure storage is shown in the following example:

select * 
from openrowset(bulk 'https://****.blob.core.windows.net/.../cases.parquet', 
                format='parquet') as cases

If you provide a file URL (for example https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet) after the BULK keyword and specify that you are using parquet format, you will get the content of file.

This file is publicly available, so you don’t need some credential to read the content of the file. If the file is protected using Azure AD authentication or SAS token, follow the instructions in this article to setup permissions.

Simple linear regression

Linear regression is simple, yet effective prediction model used to estimate the value of variable Y based on the known value of variable X under the assumption that there is a linear dependency between them. The formula for simple linear regression is:

Y = a + b*X

Mathematically, we say that Y is proportional to X . There is another value called R factor that describes how reliable is the model. The models with low R factor are not reliable.

Synapse SQL can easily determine the coefficients of linear regression model based on a set of (X, Y) values that you select from some file. If we use the simple linear regression sample that Phil Factor described in his article, we need to create a query that selects some values X and Y using some SQL query and put it into the data part of the following query:

with data as ( 
--- insert query here: 
    select x = , y =  
    from ...... 
--- end of query 
) 
 SELECT  
    ((Sy * Sxx) - (Sx * Sxy)) 
    / ((N * (Sxx)) - (Sx * Sx)) AS a, 
    ((N * Sxy) - (Sx * Sy)) 
    / ((N * Sxx) - (Sx * Sx)) AS b, 
    ((N * Sxy) - (Sx * Sy)) 
    / SQRT( 
        (((N * Sxx) - (Sx * Sx)) 
         * ((N * Syy - (Sy * Sy))))) AS r 
    FROM 
      ( 
      SELECT SUM(data.x) AS Sx, SUM(data.y) AS Sy, 
        SUM(data.x * data.x) AS Sxx, 
        SUM(data.x * data.y) AS Sxy, 
        SUM(data.y * data.y) AS Syy, 
        COUNT(*) AS N 
        FROM data 
      ) sums;

The query in data part will provide the result set that will be used for linear regression analysis and the rest of the query will return a, b, and r parameters of the linear regression model applied on that result set. You can find more info about this query in this article.

The important thing that you need to notice here is that this is linear regression model is not created on client side. This query can be executed on large dataset and build prediction model based on large number of records without transferring data to the client side.

Applying linear regression on COVID dataset

Now it is time to use SQL endpoint to perform linear regression on the COVID dataset. Probably the easiest tool that you can use to analyze file in Synapse Studio – Web UI where you can write a T-SQL query and see results in browser.

We can begin analysis by exploring the columns in dataset on first 10 records:

In this set we can see that we are getting new daily entries for the reported COVID cases in each country. The interesting columns are number of new cases (cases), number of fatal cases (deaths), id of the country (geo_id), and the population of the country recorded 2018 (pop_data_2018).

Let’s test a hypothesis that number of fatal cases (deaths) is proportional to the number of total cases. First let’s get the number of total and fatal cases per each country using the following query:

select geo_id, cases = sum(CASES), deaths = sum(DEATHS) 
    from openrowset(bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet', 
                format='parquet') as a 
    group by geo_id 
order by cases;

If we browse the results of this query using Synapse Studio, we will see that we have many countries with small number of cases and some countries with many fatal cases.

There are some countries that avoided pandemic with small number of cases and other big countries where pandemic escalated like in USA, Brazil, Italy, and Spain. Therefore, we will exclude countries with less than 100 cases and focus on the mid-size countries with the population between 50K and 11 million. We can easily add this filter condition in the having clause of the query:

select geo_id, cases = sum(CASES), deaths = sum(DEATHS) 
    from openrowset(bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet', 
                format='parquet') as a 
    group by geo_id 
    having AVG(CAST(POP_DATA_2018 AS BIGINT)) between 50000 and 11000000 
    AND sum(CASES) > 100 
    and sum(DEATHS) > 100 
order by cases;

Now it is time to add the result of this query into the linear regression query. We will rename the columns to X and Y, cast them to float to avoid arithmetic overflow, remove order by clause because it is unnecessary and run the following query:

with  data as ( 
--- insert query here: 
    select x = CAST(sum(CASES) AS FLOAT), y = CAST(sum(DEATHS) AS FLOAT) 
    from openrowset(bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.parquet', 
                format='parquet') as a 
    group by geo_id 
    having AVG(CAST(POP_DATA_2018 AS BIGINT)) between 50000 and 11000000 
    AND sum(CASES) > 100 
    and sum(DEATHS) > 100 
--- end of query 
) 
 SELECT  
    ((Sy * Sxx) - (Sx * Sxy)) 
    / ((N * (Sxx)) - (Sx * Sx)) AS a, 
    ((N * Sxy) - (Sx * Sy)) 
    / ((N * Sxx) - (Sx * Sx)) AS b, 
    ((N * Sxy) - (Sx * Sy)) 
    / SQRT( 
        (((N * Sxx) - (Sx * Sx)) 
         * ((N * Syy - (Sy * Sy))))) AS r 
    FROM 
      ( 
      SELECT SUM(data.x) AS Sx, SUM(data.y) AS Sy, 
        SUM(data.x * data.x) AS Sxx, 
        SUM(data.x * data.y) AS Sxy, 
        SUM(data.y * data.y) AS Syy, 
        COUNT(*) AS N 
        FROM data 
      ) sums;

The result of this query will give us A and B parameters of the linear regression model that estimates mortality rate of COVID:

A	B	R
-24.7405767572961	0.0356209053608031	0.57349202459482

Using these parameters, we might assume that the following formula will estimate number of fatal cases:

deaths = 0.0356209053608031 * cases -24.7405767572961

We might conclude that around 3.6% of total cases is fatal, but by looking at the R factor of this model 0.57349202459482 we can see that this is not reliable estimate. In this case, we can conclude moderate correlation between number of total cases and number of fatal cases in this dataset. For more reliable analysis, a larger data sample is needed.

However, the approach used in this example shows how to apply linear regression model on any other dataset.

Conclusion

Synapse SQL endpoint and Synapse Studio in Azure Synapse Analytics are tools for easy data exploration that enable you to easily read the content of a file and do basic data prediction on very large datasets. In this example, we have seen how to leverage T-SQL language to create simple linear regression model on a COVID dataset. Although the model created on the dataset is not reliable enough, we have seen the approach that you can use to create your own simple linear regression models on any other dataset.

You can find more information about querying files in Azure Data Lakes Storage in this article.

Updated Jun 16, 2020

Version 4.0

Synapse SQL

synapse studio

JovanPop

Microsoft

Joined March 07, 2019

View Profile

Azure Synapse Analytics Blog

Follow this blog board to get notified when there's new activity