In this blog, you will learn how to use Azure Synapse Analytics to query data collected into Azure Log Analytics and was exported to Azure Data Lake storage. Log analytics new “Data export” feature allows you to also send the collected logs to Azure Data Lake Storage. Exported data can be retained for very long periods of time relatively cheap.
This capability can be used in several scenarios:
First, you are looking to run machine learning or batch learning models on top of data collected to a Log Analytics workspace.
Another option, you need to export Log Analytics data to another warm storage, might be for auditing purposes, this will also allow you to perform data manipulation\filtering in the process.
Last, you might want to retain data also externally to Log Analytics workspace, and you want to reduce costs, by converting the data to Parquet format, again after performing data manipulation\filtering in the process.
This blog will review couple of options to analyze the exported data. Using Azure Synapse, a limitless analytics service that brings together data integration, enterprise data warehousing and big data analytics. Synapse provides different tools and approaches to work with the collected data, including T-SQL scripts, spark code, and machine learning models.
In this blog, two features of Azure Synapse Analytics service will be used:
Spark model is designed for machine learning or complex (batch) analytics, while serverless database is especially useful for auditing scenario, for users that are familiar with T-SQL.
Setting up the environment
To setup the environment, you will first need to configure “data export” from your Log Analytics workspace to an ADLS gen2 (Azure data lake storage). Next step, you will create an Azure Synapse environment and connect the ADLS gen2 to this environment.
The first step, setting up "Data export" on the Log Analytics workspace. By setting up Log Analytics data export, you will be able to export specific Log Analytics tables to the ADLS Gen2 storage as JSON files. The data is exported to your storage account as hourly append blobs. The JSON files are created in a hierarchical folder structure, with a date/time prefixes in file or blob names. This allows ad hoc queries and hourly/daily jobs to filter blobs by time range.
Folder structure: WorkspaceResourceId=/subscriptions/{subscriptionId}/resourcegroups/{ResourceGroupName}/providers/microsoft.operationalinsights/workspaces/{WorkspaceName}/y=YYYY/m=MM/d=dd/h=hh/m=mm/PT1H.json
Next, let's spin up a Synapse environment, and link it to your ADLS storage.
Follow the instructions to Create and setup a Synapse workspace.
Make sure to set the right permissions for any user that will access Synapse studio:
For details on Azure Synapse Access control, see this article.
There are multiple ways to process the exported data, this blog will review two of them. First, you will learn to use the Spark engine to query the exported data. In the second scenario, you will first learn to convert the exported data to Parquet file, then learn how to create external tables based on the Parquet files, using Synapse’s serverless database engine.
First scenario – Query the exported data using Spark engine
This scenario will use Spark engine, this engine is usually more oriented for users from the machine learning space.
Before we can start to work, let’s configure the Spark environment:
Select the "Manage" icon --> "Apache spark pools" --> "New"
Name the pool "DefaultPool", Node size "small", Auto-scale "Disabled"\ number of nodes 3.
--> click "Create".
As the simplest example, let's try to count how many records have been exported to the Perf table.
Note: First run will take longer, as the spark pool needs to spin up.
Second scenario – create external data table using converted Parquet files
In the first scenario, the data was kept in the original JSON format. In the following steps, you will learn how to convert the JSON files to Parquet files and create an external table using Synapse’s serverless database engine. This approach will allow you to get better performance both in querying speed and lower storage costs, though you will need to keep the Synapse environment running.
This method can also allow you to perform data manipulation, for example: filter rows, remove columns, and even join between different tables.
Parquet is a popular column oriented binary format and has some useful advantages over the JSON formatted files, specifically:
Couple of notes, Serverless will be slower compared to dedicated pool. You might want to consider loading the data to dedicated clusters at a later stage, also there is a restriction that serverless does not support time series, geo-spatial data.
Shared metadata tables - Azure Synapse Analytics | Microsoft Docs
Note: you can still use spark as in the first scenario, but load from the Parquet files, and still benefit from the same fast loading time.
First step, you will create a ‘copy activity’, the activity will convert the exported container logs JSON files to Parquet format:
Now, the copy task will be deployed, this process might take a while.
You will see in the "Integrate" tab on the left, under “Pipelines” your new pipeline should be listed. When clicking on the pipeline, a canvas with "Copy data" task will open.
To test the pipeline, first trigger the task to see that it is working as expected, on the top side of the canvas, Click "Trigger" --> "Trigger now" --> Click "Ok"
* Here we can add any transformation task and connecting this as the sync to the copy task.
Data flows - Azure Synapse Analytics | Microsoft Docs
Managing the mapping data flow graph - Azure Data Factory | Microsoft Docs
After we confirmed that the copy task works, we have two options, either we will trigger the task each time we want to query the data, this might be useful for scenarios where the data set is small, we are not running many operations on the data, and we are willing to “pay” for the extra operation time.
The other option is to create a scheduled trigger, this will continuedly convert the JSON file to parquet files.
Create tumbling window triggers in Azure Data Factory - Azure Data Factory | Microsoft Docs
Pipeline execution and triggers in Azure Data Factory - Azure Data Factory | Microsoft Docs
Second step, create an External table on-top of the Parquet files:
Last step, query the data:
Conclusion
In this blog, we covered two possible methods for analyzing data exported from Azure Log Analytics using Azure synapse. Both methods suggested are simple, quick to deploy and effective.
For example, in the second scenario described in this blog, we are using Synapse's serverless database to create external table, using this option allows us to load the data from the files converted from JSON to Parquet format, this way we also save storage space and get better query performance in Synapse. We should keep in mind, that using dedicated SQL Pool will give better performance but will cost more.
We do recommend to continue and explore the many features Azure Synapse provides and find the best flow for your particular business need.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.