Sep 19 2022 08:51 PM
* There are graphs in this post, but I can't seem to upload/insert them; please visit the link in each part to see the picture.
Data prioritization is an issue that any SIEM or data gathering and analysis solution must consider. The log that we collect to SIEM is typically security-related and capable of directly creating alerts based on the event of that log, such as EDR alerts. However, not all logs are equally weighted. For example, the proxy log only contains the connections of internal users, which is very useful for investigation, but it does not directly create alerts and has a very high volume. To demonstrate this, we categorize the log into the primary log and secondary log based on its security value and volume.
The metadata and context of what was discovered are frequently contained in the primary log sources used for detection. However, secondary log sources are sometimes required to present a complete picture of a security incident or breach. Unfortunately, many of these secondary log sources are high-volume verbose logs with little relevance for security detection. They aren’t useful unless a security issue or threat search requires them. On the current traditional on-premise solution, we will use SIEM alongside a data lake to store secondary logs for later use.
Because we have complete control over everything, we can use any technology or solution, making it simple to set up (Eg. Qradar for SIEM and ELK for data lake). However, for cloud-naive SIEM, this becomes more difficult, particularly with Microsoft Sentinel. Microsoft Sentinel is a cloud-native security information and event manager (SIEM) platform that includes artificial intelligence (AI) to help with data analysis across an enterprise. To store and analyze everything for Sentinel, we typically use Log Analytics with the Analytics Logs data plan. However, this is prohibitively expensive, costing between $2.00 and $2.50 per GB ingested per day depending on the Azure region used.
To store these secondary data, the present approach uses Blob Storage. Blob storage is designed to hold large volumes of unstructured data, which implies it does not follow a certain data model or specification, such as text or binary data. This is a low-cost option for storing large amounts of data. The architecture for this solution is as follows:
However, Blob Storage has a limitation that is hard to ignore. The data in Blob Storage is not searchable. We can circumvent this by using as demonstrated in Search over Azure Blob Storage content, but this adds another layer of complexity and pricing that we would prefer to avoid. The alternative option is to use KQL externaldata, but this is designed to obtain small amounts of data (up to 100 MB) from an external storage device, not massive amounts of data.
Our solution used Basic Logs to tackle this problem. Basic Logs is a less expensive option for importing large amounts of verbose log data into your Log Analytics workspace. The Basic log also supports a subset of KQL, making it searchable. To get the log into the Basic Log, We need to use a Custom table generated with the Data Collection Rule (DCR)-based logs ingestion API. The structure is as follows:
In our experiment, we use the following component for the architecture:
Component | Solution | Description |
Source Data | VMware Carbon Black EDR | Carbon Black EDR is an endpoint activity data capture and retention solution that allows security professionals to chase attacks in real-time and observe the whole attack kill chain. This means that it captures not only data for alerting, but also data that is informative, such as binary or host information. |
Data Processor | Cribl Stream | Cribl helps process machine data in real-time - logs, instrumentation data, application data, metrics, and so on - and delivers it to a preferred analysis platform. It supports sending logs to Log Analytics, but only with the Analytics plan. |
To send the log to the Basic plan, we need to set up a data collection endpoint and rule, please see Logs ingestion API in Azure Monitor (Preview) for additional information on how to set this up. And we also use a Logic App as a webhook to collect the log and send it to the Data collection endpoint.
The environment we use for log generation is as follows:
The number of logs we collected for our test environment are:
The cost is based on the East US region, the currency is the USD, and the Pay-As-You-Go Tier was used to determine the number saved using the generated data with 1000 hosts and 30 days retention period.
The calculation using only Analytic Log
Table | Ingestion Volume (GB) | Cost per GB (USD) | Total cost per day (USD) | Total cost per retention period (USD) | Host number | Retention (Days) |
Cb_logs_Raw_CL | 2.16 | 2.3 | 4.96 | 148.84 | 1000 | 30 |
Cb_logs_alert_CL | 1.19 | 2.3 | 2.73 | 81.81 | 1000 | 30 |
|
| Total | 7.69 | 230.66 |
|
|
If we use Analytic Log with Storage Account
Table | Ingestion Volume (GB) | Cost per GB (USD) | Total cost per day (USD) | Total cost per retention period (USD) | Host number | Retention (Days) |
Cb_logs_Raw_CL | 2.16 | 0.02 | 0.04 | 1.29 | 1000 | 30 |
Cb_logs_alert_CL | 1.19 | 2.3 | 2.73 | 81.81 | 1000 | 30 |
|
| Total | 2.77 | 83.11 |
|
|
If we use Analytic Log with Basic Log
Table | Ingestion Volume (GB) | Cost per GB (USD) | Total cost per day (USD) | Total cost per retention period (USD) | Host number | Retention (Days) |
Cb_logs_Raw_CL | 2.16 | 0.5 | 1.08 | 32.36 | 1000 | 30 |
Cb_logs_alert_CL | 1.19 | 2.3 | 2.73 | 81.81 | 1000 | 30 |
|
| Total | 3.81 | 114.17 |
|
|
Now let’s compare these 3 solutions together and get an overall look altogether.
| Only Analytic Log | Analytic Log with Storage Account | Analytic Log with Basic Log |
Cost calculated | $230.66 | $83.11 | $114.17 |
Searchable | Yes | No | Yes but cost $0.005 per GB |
Retention | Up to 2,556 days (7 years) | 146,000 days (400 years) | Up to 2,556 days (7 years) |
Even though the Basic Log is an excellent choice for ingesting hot data, it does have some limitations that are difficult to overlook:
This is the first post in this Sentinel Cost Optimization series. I hope this helps you have another choice to consider when setting up and sending your custom log to Sentinel.