Detect Masqueraded Process Name Anomalies using an ML notebook

Former Employee

Aug 11, 2022

Process Masquerading is an extremely common attack-vector technique. It occurs when the name or the location of a legitimate process is manipulated to avoid detection of its malicious behavior. It can include alteration of metadata, names or paths where these processes run. Windows processes have expected characteristics such as names, parent processes, paths from where they are expected to run, etc. In this blog, we will look into process name masquerading wherein attackers name their payloads similar to known normal processes. Adversaries like to take advantage of the fact that analysts may not always have the proper tools, data, or context to investigate threats thoroughly. A common technique is to slightly modify a legit process and execute a payload that way. A regularly abused process is the Windows Service Host (svchost.exe). When security analysts must dive into large corpuses of security data, these are needles in a haystack and are easy to miss when simply looked at and so they can succeed at running malicious code on your machine.

Looking at the following list of words, it is not easy to spot the odd one out. And these are just 30 processes in total. When security analysts must go over large sets of data where they are not only looking at the names of these processes, but also deriving meaning through context of parent processes which trigger these and the location from where they are run, it is very easy to gloss over and mistake the malicious processes for good ones. This is one of the main reasons why this kind of problem can be solved using statistical based rules and associations.

In this blog, we will find these small differences using a sort of edit distance between two strings. To be able to find small differences between the good and the malicious processes, instead of finding two process names that are A% similar, we should find process names that are X deviations from the original process name. By finding a count of deviations, the length of the process names would not be a problem and would not hamper any statistics or skew results. We will be using the Levenshtein distance which outputs the number of character substitutions, transpositions, deletions, and insertions it takes for one string to become another, to generate the potentially malicious processes. This function is taken from the ‘fastDamareuLevenshtein’ package.

Using the BYOML notebook to find masquerading processes

Customers can now use their Sentinel workspace to ingest large datasets and leverage ready-to-use BYOML notebooks to implement Big Data Analytics, powerful statistical functionalities and ML techniques in a simple, straightforward way. Sentinel provides the capability of integrating with and provisioning a variety of resources to make your development much easier. Using Azure ML, a cloud service to manage your ML project lifecycle, Azure Synapse, a powerful analytical service to enable working on large datasets and a variety of data connectors and Log Analytics export mechanisms to perform ETL operations, you can extract insights from big data in Security seamlessly. Data scientists and security analysts can now take advantage of this and swim through large security datasets to perform their analysis quickly.

This notebook leverages Sentinel and its capabilities to apply statistics and big data technologies to enable you to sift through large datasets, where the potentially malicious processes may be small, fairly quickly, and provides anomalous results along with context for further analysis. The notebook primarily uses SecurityEvents data with EventID=4688 which are process creation events from your Windows machines. It can be run via Azure Machine Learning Studio or directly using Azure Synapse. Details of setting up both environments and ingesting data are available inside the notebook as well as on the Microsoft tech community blog. To launch the notebook, simply find it in the Notebooks tab on your Microsoft Sentinel, save it and click launch notebook. You can follow the instructions in our documentation to launch this notebook.

The notebook is also available in our Sentinel Notebooks GitHub repository and a demo of this notebook can be found below.

There are some packages which will be needed and downloading and installing them on your cluster is explained in the notebook. You can use either AzureML or Azure Synapse directly (for more seasoned coders) to run the notebook. Most of the instructions are the same, AzureML requires a bit more setup since it is the outside framework which will need to communicate with the Synapse clusters.

The data that will be ingested is process creation events from your Windows machines. Since you already have a Sentinel workspace, you will need to set up the Windows data connectors to capture the process creation events and send them to your Log analytics workspace. From here, they will need to be exported to a storage account which will be read by this notebook. This export is explained in the notebook and more details can be found at this link: LA export mechanism. Configuring the environments correctly is explained in this Configuration notebook in a simple manner. You will also have to create your Log Analytics workspace, a keyVault to store all secrets and then add that KeyVault as a linked service to your Azure Synapse workspace.

Algorithm Details

There are a couple of helper functions defined in the notebook. We are adding synthetically generated known and malicious process events so that you will be able to understand the types of events that are caught by this notebook and some examples of popular processes which can potentially be masqueraded, for future knowledge. Some common examples include svchost.exe, winlogon.exe, services.exe.

Next, we ingest the data from your workspace, select the columns we are interested in and add the synthetically generated events to it. This is our data corpus.

The logic of the algorithm is to compare frequency of different processes to categorize them and then judge their maliciousness using the library. Those processes occurring more than 'frequentThreshold' percentile of the time are considered normal and those occurring less than 'infrequentThreshold' percentile of the time are considered potentially malicious. You can customize these thresholds based on how your data appears and your usage. Those in the middle range are excluded from analysis because they fall in the grey area of being of relatively high popularity but falling below the first threshold.

At the end, we perform the comparison between these frequent and infrequent processes using Levenshtein distance and cap the threshold at a particular value to provide the most useful potentially malicious process names. As can be seen in the image below, the distance not only depends upon the number of characters which are different, but also the length of the strings.

We then export the anomalous results along with the path information, for context, to your Log Analytics workspace for further analysis.

Summary

This is a very simple notebook with powerful capabilities to help data scientists and security analysts catch and combat one of the most common attack vectors, the masquerading process names. It also demonstrates the use of the Azure Synapse integration for Sentinel to facilitate highly scalable advanced analytics against large security log datasets directly from a Sentinel notebook.

If you are new to Sentinel Notebooks, here are some materials for further reading: