Scaling Up Syslog CEF Collection
Published Feb 20 2020 04:13 PM 22.4K Views

This blog post is authored by Nicholas DiCola

 

In the last few months working on Microsoft Sentinel, many customers have asked me about scaling up syslog CEF collection for getting data into Microsoft Sentinel.  I have created two sample architectures with code deployment for this purpose.  The samples is available at:

 

CEF-VMSS is for deploying native Microsoft Sentinel CEF collection by sending syslog CEF message to rsyslog which then sends the messages to the Log Analytics Agent.

 

Logstash-VMSS is for deploying Logstash on the VMs to do message manipulation which then sends the messages to the Log Analytics Agent.  You may also want to use this architecture and change the input to a source like Kafka.

 

I will not deep dive on all the topics of this architecture.  You can research each on your own and will focus on an overview of the architecture.

 

Virtual Machine Scale Set

The architecture starts with a VMSS which lets you manage and create a group of virtual machines.  These VMs can autoscale up and down additional instances based on schedule or demand.  The sample uses autoscale settings to configure the VMSS to scale up and down based on CPU (demand) of messages being sent.

 

I have included a Load Balancer in-front of the VMSS which will allow you to configure 1 destination IP address (the Public Ip Address) and it will spread the incoming messages across the running instances.

 

There are 2 ARM templates for RedHat and Unbuntu.  The templates deploy everything needed for the architecture.  One key part of the ARM templates is using cloud init to configure the VMSS instances as they are created.  Below is the Unbuntu cloud init files.

 

Cloud Init for CEF-VMSS:

#cloud-config
package_upgrade: true
runcmd:
  - sudo apt-get update
  - sudo wget https://raw.githubusercontent.com/Azure/Azure-Sentinel/master/DataConnectors/CEF/cef_installer.py&&sudo python cef_installer.py

As you can see the cloud init, it installs updates and the Log Analytics Agent using the Microsoft Sentinel CEF script.  The ARM template appends the workspace id and workspace key to the last line so that the agent gets connected to the right workspace

 

Cloud Init for Logstash-VMSS:

#cloud-config
package_upgrade: true
packages:
  - default-jre
runcmd:
  - wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
  - sudo apt-get install apt-transport-https
  - echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list
  - sudo apt-get update
  - sudo apt-get install logstash
  - sudo /usr/share/logstash/bin/logstash-plugin install logstash-output-syslog
  - sudo /usr/share/logstash/bin/logstash-plugin update
  - wget -q https://raw.githubusercontent.com/Azure/Azure-Sentinel/master/DataConnectors/Logstash-VMSS/logstash.config -O /etc/logstash/config.d/logstash.config
  - echo "update this line with wget -q https://sourceURL -O /etc/logstash/pipelines.yml if you have a custom pipelines file"
  - sudo systemctl start logstash.service
  - sudo wget https://raw.githubusercontent.com/Azure/Azure-Sentinel/master/DataConnectors/CEF/cef_installer.py&&sudo python cef_installer.py

It installs Java, Logstash, Logstash Syslog Output plugin and the Log Analytics Agent using the Microsoft Sentinel CEF script.  The ARM template appends the workspace id and workspace key to the last line so that the agent gets connected to the right workspace.

CEF

CEF is our default way to collect external solutions like firewalls and proxies.  The CEF install script will install the Log Analytics agent, configure rsyslog, and configure the agent for CEF collection.

Logstash

Logstash dynamically ingests, transforms, and ships your data regardless of format or complexity.  It has many input, filter and output plugins.  This can allow you to get data from many sources, manipulate the event data and output to the Log Analytics Agent locally on the machine.  There are many input plugins so this makes it easy to connect to other sources like Kafka.

 

Here is the sample logstash.conf file that is used in the sample architecture:

input {
  tcp {
    port => 5514
    type => syslog
    codec => cef
  }
  udp {
    port => 5514
    type => syslog
    codec => cef
  }
}

filter {
  geoip {
    source => "src"
    target => "srcGeoIP"
    add_field => { "sourceLongitude" => "%{[srcGeoIP][longitude]}" }
    add_field => { "sourceLatitude" => "%{[srcGeoIP][latitude]}" }
  }
  geoip{
    source => "dst"
    target => "dstGeoIP"
    add_field => { "destinationLongitude" => "%{[dstGeoIP][longitude]}" }
    add_field => { "destinationLatitude" => "%{[dstGeoIP][latitude]}" }
  }
  mutate{
    add_field => { "agentReceiptTime" => "%{@timestamp}" }
  }
}

output {
  syslog {
  host => "127.0.0.1"
  port => 25226
  protocol "tcp"
  codec => cef {
    reverse_mapping => true
    delimiter => "\r\n"
    vendor      => "%{deviceVendor}"
    product     => "%{deviceProduct}"
    version     => "%{deviceVersion}"
    signature   => "%{deviceEventClassId}"
    name        => "%{name}"
    severity    => "%{severity}"
    fields => [
      "deviceAction",
      "applicationProtocol",
      "deviceCustomIPv6Address1",
      "deviceCustomIPv6Address1Label",
      "deviceCustomIPv6Address2",
      "deviceCustomIPv6Address2Label",
      "deviceCustomIPv6Address3",
      "deviceCustomIPv6Address3Label",
      "deviceCustomIPv6Address4",
      "deviceCustomIPv6Address4Label",
      "deviceEventCategory",
      "deviceCustomFloatingPoint1",
      "deviceCustomFloatingPoint1Label",
      "deviceCustomFloatingPoint2",
      "deviceCustomFloatingPoint2Label",
      "deviceCustomFloatingPoint3",
      "deviceCustomFloatingPoint3Label",
      "deviceCustomFloatingPoint4",
      "deviceCustomFloatingPoint4Label",
      "deviceCustomNumber1",
      "deviceCustomNumber1Label",
      "deviceCustomNumber2",
      "deviceCustomNumber2Label",
      "deviceCustomNumber3",
      "deviceCustomNumber3Label",
      "baseEventCount",
      "deviceCustomString1",
      "deviceCustomString1Label",
      "deviceCustomString2",
      "deviceCustomString2Label",
      "deviceCustomString3",
      "deviceCustomString3Label",
      "deviceCustomString4",
      "deviceCustomString4Label",
      "deviceCustomString5",
      "deviceCustomString5Label",
      "deviceCustomString6",
      "deviceCustomString6Label",
      "destinationHostName",
      "destinationMacAddress",
      "destinationNtDomain",
      "destinationProcessId",
      "destinationUserPrivileges",
      "destinationProcessName",
      "destinationPort",
      "destinationAddress",
      "destinationUserId",
      "destinationUserName",
      "deviceAddress",
      "deviceHostName",
      "deviceProcessId",
      "endTime",
      "fileName",
      "fileSize",
      "bytesIn",
      "bytesOut",
      "eventOutcome",
      "transportProtocol",
      "requestUrl",
      "deviceReceiptTime",
      "sourceHostName",
      "sourceMacAddress",
      "sourceNtDomain",
      "sourceProcessId",
      "sourceUserPrivileges",
      "sourceProcessName",
     "sourcePort",
      "sourceAddress",
      "startTime",
      "sourceUserId",
      "sourceUserName",
      "agentHostName",
      "agentReceiptTime",
      "agentType",
      "agentId",
      "cefVersion",
      "agentAddress",
      "agentVersion",
      "agentTimeZone",
      "destinationTimeZone",
      "sourceLongitude",
      "sourceLatitude",
      "destinationLongitude",
      "destinationLatitude",
      "categoryDeviceType",
      "managerReceiptTime",
      "agentMacAddress"
      ]
    }
  }
}

The inputs accept both TCP and UDP on port 5514.  I used 5514 because Logstash runs as non-root and requires special configuration to use port 514.  I decided to keep it simple.  On input, its expecting CEF format using “codec => cef” and tags the event as syslog.

 

Once the event is accepted, I have added a few filters.  The first uses the GeoIP plugin which uses the local GeoLite2 database to lookup the source and destination IP addresses.  These are added to a custom field and to align with RFC compliance I then add_field to bring the latitude and longitude into proper fields.  I also mutate to copy the message received time into agentRecievedTime.  This is important as Logstash will send the message to the Log Analytics using its time which will end up as TimeGenerated in Log Analytics.  Doing this will allow you to see the original send time and the time Logstash sent it.  A simple compare will show you how long its take to process.

 

In the output section, I use the syslog plugin to output the message to the agent.  The agent listens on TCP 127.0.0.1:25226.  I set the output plugin to the CEF codec again and there are couple of key important configs.  “reverse_mapping => true” ensures that the message is sent using the short names (src vs sourceAddress) which is required by Log Analytics.  The fields portion requires all fields you want to send. I have included all fields the CEF codec supports.  If the fields doesn’t exist it wont be sent.

Microsoft Sentinel

Once the data is sent to the agent, it will follow all the normal CEF collection ingestion process and end up in CommonSecurityLog.  You can monitor the VMSS event per second using the following query:

CommonSecurityLog
| where _TimeReceived > ago(20m)
| summarize count() by bin(_TimeReceived, 1m), _ResourceId
| extend count =count_ /60
| sort by _TimeReceived desc

This will get all logs in the last 20m and summarize by TimeRecieved and ResourceId.  This gives you the # of event per minute, so you need to create a count column equal to count_ divided by 60 seconds.  Now you can see the EPS per VMSS instance.

If you have performance issues, I recommend you look into Ryslog performance or Logstash performance tuning.

 

Some future improvements I might add:

  • Implement impstats for rsyslog and send data to Log Analytics.  This would allow performance monitoring of rsyslog (dashboarding, queries)
  • Implement GeoIP in rsyslog
  • Implement Logstash monitoring APIs and send data to Log Analytics.  This would allow performance monitoring of Logstash (dashboarding, queries)
  • Create additional sample using fluentd
  • Create additional sample using syslog-ng

Thanks for reading!

14 Comments
Copper Contributor

I moved this feedback to the github repo

Copper Contributor
Hi Nicholas, I was wondering if you know how I can daisy chain collectors like in the timed youtube link (https://youtu.be/_mm3GNwPBHU?list=PLOhMGpMOPKRHPHCvzia3EE5OY5EkQRCuH&t=896) was presented by one of Microsoft's Sentinel guys called Ofer. I was able to install the CEF-Syslog Ubuntu server on-prem but I am trying to do the Syslog-Collector proxy. I believe there must be some configs that need to change on both the nodes. Simply the architecture I was to put is such: 1. Install a CEF-Syslog on-prem Ubuntu machine to collect all CEF and Syslog sources 2. Forward the data from the CEF-Syslog machine to another Syslog-Collector-Proxy in another segment of my network to forward to Azure to Sentinel Can you help? Thank you, Egal
Copper Contributor

I see Standard_B16ms sku is used in the vmss.

Was this a decision or just by sample?

Hi @eciruam I actually need to update the template.  I worked with the agent team and they did some testing.  With an F4s_v2 VM size they were able to achieve higher EPS and recommended that as the size.

Copper Contributor

Nicholas,

 

do you have any concerns about scaling down a vm scale set without knowing if it's queue is empty?  isn't there some risk of lost messages?

 

I am toying with the idea of adapting my syslog-ng front-ends to push messages into a service bus queue (or multiple queues), and then using a custom input into logstash to peek a message from the queue, process it through logstash, and then deque the message after processing.

 

That would remove the need for load balancers, and allow the scale sets to expand/contract based on load (or ideally service bus queue sizes), and guarantee delivery of the messages.

 

thoughts?

@Justin Ainsworth 

theoretically, yea i guess that could happen.  but you could set autoscale settings to check network out instead of CPU.  https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-autosca...

 

yea you could do some kind of bus in azure.  i went the simple route to solve a customer request.

Copper Contributor

Hi Nicolas,

 

If someone's organisation policy does not allow SSH and public IP on load balancer, what is work around for this situation.

Copper Contributor

Hi Nicholas DiCola (SECURITY JEDI) 

 

I have deployed the solution using the template, thank you for putting the work in to make it happen.

 

I am using the Ubuntu script.

 

I note when a VM is created the syslog service is not listening.  If I execute the cef_installer.py with workstation id and password from the root folder it reinstalls/reconfigures and syslog service is listening.

 

I am doing nothing different to the installer but getting a different outcome.  

 

Can you suggest anything to allow this to not require the human intervention?

 

Thank you,

George

Copper Contributor

Hi Nicholas,

 

My Question might be somewhat in general.

1.) What types of data formats (Syslog, CEF, custom) can be taken up by Logstash for which it has some kind of mapping mechanism to ingest the data into Log Analytics workspace.

2.) Does Logstash have any mapping mechanism to map data format from any data source to convert into CEF or syslog which I suppose are the preferred choices of Sentinel?

3.) Do we really need Log Analytics Agent in between Logstash and Log Analytics workspace of sentinel?

4.) Which would be the best choice of data format (syslog, CEF, or custom) out from logstash to ingest into Log Analytics workspace of sentinel?

 

5.) Can the logstash-vmss be deployed on-prem?

 

6.) Could you please suggest to me the best choices of data intake format (I do understand various data sources may have their own data formats ) to Logstash from any data sources and data output format from Logstash to Log Analytics workspace of sentinel?

 

Regards,

Simranjeet

 

 

 

 

1.) What types of data formats (Syslog, CEF, custom) can be taken up by Logstash for which it has some kind of mapping mechanism to ingest the data into Log Analytics workspace.

Logstash is an event pipeline system.  it has many input plugins https://www.elastic.co/guide/en/logstash/current/input-plugins.html

It can also transform data during parsing.  So to answer many data formats can be changed to match CEF output.  This VMSS solution was designed specific to solve customer need to get SYSLOG (CEF FORMAT) messages, add geo ip information, then send to log analytics.

2.) Does Logstash have any mapping mechanism to map data format from any data source to convert into CEF or syslog which I suppose are the preferred choices of Sentinel?

Yes please review on Logstash documentation.

3.) Do we really need Log Analytics Agent in between Logstash and Log Analytics workspace of sentinel?

For this scenario, yes.  CommonEventLogs come from the agent only.  You could change the output to Custom Logs using the output connector for Log Analytics.

4.) Which would be the best choice of data format (syslog, CEF, or custom) out from logstash to ingest into Log Analytics workspace of sentinel?

There is no simple answer.  Depends on the data source.

 5.) Can the logstash-vmss be deployed on-prem?

Logstash can be deployed anywhere.  So can the Log A Agent.  The VMSS is specific to Azure resource.  You would need to write your own installer scripts for on-prem

 6.) Could you please suggest to me the best choices of data intake format (I do understand various data sources may have their own data formats ) to Logstash from any data sources and data output format from Logstash to Log Analytics workspace of sentinel?

It depends on your data source?  Is it a network Appliance like a firewall, normally that’s CEF.  Windows Event Logs should go to windows event or security event.

Copper Contributor

Hi Nichola

I am grateful for the replies to my previous question.

Since you have told me that Logstash and Log A Agent can be deployed anywhere and I have seen on this page https://docs.microsoft.com/en-us/azure/sentinel/connect-logstash how to configure Logstash and send logs Log Analytics workspace using a log A output plugin.

I wish to know: 

 

1.) In the above link, there is no mention of Log A Agent which should be there in the case of Logstash-vmss's architecture. As per my understanding, the Log A output plugin takes logs from Logstash and ingests them into Log A workspace in a custom table which I suppose is neither syslog no cef format. Am I correct at this point?

 

2.) Can we build such a model where the whole of Microsoft's grand list of data sources(https://techcommunity.microsoft.com/t5/azure-sentinel/azure-sentinel-the-connectors-grand-cef-syslog...)  be ingested into sentinel in one single common format (cef/syslog not custom) using Logstash?

 

3.) I read somewhere sentinel gives better monitoring, analytics, correlation, incident generation, etc. if data from all data sources be ingested into sentinel in cef/syslog format. There can be quality issues if every data source has its own custom data format and table in sentinel which will not allow sentinel to do better analytics on data because of the randomness of data field names. Am I correct on this point?

 

4.) Can sentinel perform data correlation and analytics if there are N numbers of custom tables present for different data source security appliances?

 

Nicholas, I am sorry if I am eating your mind too much.

 

Regards,

Simran

 

1.) In the above link, there is no mention of Log A Agent which should be there in the case of Logstash-vmss's architecture. As per my understanding, the Log A output plugin takes logs from Logstash and ingests them into Log A workspace in a custom table which I suppose is neither syslog no cef format. Am I correct at this point?

the document is how to connect logstash to send custom logs.  The VMSS is to send syslog to CEF table in log a.  you can use either option, but the VMSS was very specific to get CEF logs into CEF table.

 

2.) Can we build such a model where the whole of Microsoft's grand list of data sources(https://techcommunity.microsoft.com/t5/azure-sentinel/azure-sentinel-the-connectors-grand-cef-syslog...)  be ingested into sentinel in one single common format (cef/syslog not custom) using Logstash?

Yes you could.

 

3.) I read somewhere sentinel gives better monitoring, analytics, correlation, incident generation, etc. if data from all data sources be ingested into sentinel in cef/syslog format. There can be quality issues if every data source has its own custom data format and table in sentinel which will not allow sentinel to do better analytics on data because of the randomness of data field names. Am I correct on this point?

Correct.  CEF is a standard format so running queries is much easier when all syslog data is in the same format.  if each has its own custom log then you need to write queries for each custom source.

 

4.) Can sentinel perform data correlation and analytics if there are N numbers of custom tables present for different data source security appliances?

Yes but it requires a more complex query.  Hence using CEF makes it easier.

Copper Contributor

Does it make sense to configure the same syslog collector for Microsoft Sentinel for also Defender for Cloud Apps on a CentOS 7 box? Microsoft Sentinel syslog is already working, but I'm thinking of running Cloud Apps, so they work simultaneously on same box.

Hi @ENEMIESENEMY 

No.  both listen on SYSLOG port 514 so it would conflict.

Version history
Last update:
‎Nov 02 2021 05:51 PM
Updated by: