Microsoft Sentinel Blog

13 MIN READ

Hunting for Low and Slow Password Sprays Using Machine Learning

AmritpalSingh

Microsoft

Aug 11, 2022

AmritpalSingh & TomMcElroy (Microsoft Threat Intelligence Center)

With special thanks to Chi_Nguyen.

Microsoft’s threat intelligence teams are observing increasing use of password sprays as an attack vector. As sign-in protections have improved, the “low and slow” variant, has become more common; in many instances performing a password spray attack very slowly is necessary to prevent account lockout or detection. Tools to perform low and slow sprays are also more readily available in open source, the majority of which can be configured to make use of free or paid proxy services, further amplifying the issue.

We have just released a new guided hunting notebook for Microsoft Sentinel which leverages machine learning to tackle the difficult problem of detecting low and slow password spray campaigns (This augments more broad-scoped password spray detection already provided via Microsoft’s Azure AD Identity Protection Integration for Sentinel – see Advancing Password Spray Attack Detection - Microsoft Tech Community.)

Using the Azure Synapse integration for Sentinel, we can run this on auto-scaling Apache Spark pools, enabling security ML and analytics at a scale which would otherwise be infeasible.

Low and Slow Password Sprays

Low and slow sprays are a variant on traditional password spray attacks that are being increasingly used by sophisticated adversaries such as NOBELIUM, STRONTIUM and HOLMIUM. These adversaries can randomize client fields between each sign in attempt, including IP addresses, user agents and client application. Some adversaries are willing to let the password spray campaigns run at a very low frequency over a period of months or years, making detection challenging. Several open-source tools exist to automate low and slow password spray attacks. These tools often support configurations allowing randomization of parameters and time between requests.

Difficulties in Detecting Low and Slow Sprays

Whilst many existing detections may catch individual malicious sign-in attempts from a low and slow password spray, it remains a challenge to correlate and cluster these events into spray campaigns.

Traditionally when tracking identity attack campaigns, analysts seek to pivot and cluster on any client request property that may be associated with the spray campaign. For example, a password spray may use IP addresses in a specific region, or a fixed user agent that is uncommon within the environment. Generally, the properties that are used to cluster attempts are client controlled. To evade this clustering, attackers rotate these properties per-request. For the purposes of this blog this technique will be called “client field randomization”.

When attempting to detect campaigns implementing client field randomization, time-series based anomaly detection can be used. This approach spots spikes in failed or anomalous sign in attempts (either across the tenant, or on a per-user basis) either in a given period or at a specified frequency.

The low and slow approach employed by sophisticated threat groups blends client field randomization with long periods of dormancy and request jitter. This aims to evade time-series detections by keeping frequency of sign in attempts very low and the period of time between those requests pseudo-random. By combining time randomization, client field randomization and long periods of dormancy, the attacker can blend their campaign into the noise.

High-Level Approach

To hunt for potential low and slow password spray activity, we exploit some observations made by Microsoft Threat Intelligence Centre (MSTIC) analysts:

Within a single campaign, threat actors often randomize the same large number of properties simultaneously, resulting in a group of sign-ins occurring periodically over a long period of time with same set of anomalous properties.
Randomization of complex parameters such as user’s agent or IP address are often not truly random, rather they are selected randomly from a given set.
Although threat actors usually attempt to add some random noise to the schedule on which password spray sign in attempts occur, when viewed as a whole, there is usually still a distinctive uniformity to the time series of sign in attempts.

With these observations in mind, we have developed the following broad approach to hunting low and slow sprays:

Detect anomalous fields for each failed sign-in using successful sign-ins as a baseline
Cluster failed sign-ins by the columns which were randomized/anomalous
Prune the clusters from the previous step based on knowledge of what a low and slow spray looks like; for example, by removing clusters in which sign-ins do not occur at a steady frequency over an extended period
Further analyze the candidate password spray clusters (using threat intelligence enrichments from MSTICPy, for example), to find any invariant properties within the clusters
Identify any successful sign-ins that follow the patterns observed for each cluster from the previous step and create Sentinel incidents as appropriate

Running the Sentinel ML Notebook

The guided hunting notebook can be cloned from the “Templates” tab in the Notebooks blade in Sentinel:

Load Data

The template ML notebook feeds off Sentinel data exported to Azure storage, but can be easily switched to target your own data wherever it is stored by using one of Azure Synapse’s 100+ supported sources (or by importing your sign-in data into Azure storage directly).

Azure Synapse Analytics UI

We have previously blogged about how both historical and continuously exported Sentinel log data can be exported to ADLS to enable data science on Azure Synapse Spark pools from within a Sentinel notebook here: Export Historical Data from Log Analytics (microsoft.com).

It is worth noting that hunting for low and slow password sprays naturally requires lots of historical log data (typically going back at least several months); this means we could easily be reading in, and performing ML/analytics on, over 100,000 log data files. Depending on your storage and compute configuration, this could be a very time-consuming and expensive! However, by leveraging the ability of our Synapse Spark pool to dynamically scale horizontally, this operation can be massively parallelized. This dramatic reduction in time-to-results puts the focus back on the actual data science and security analytics. The Azure ecosystem acts as the single pane of glass providing SIEM, data ETL, big data analytics and ML.

Spark auto-scaling limits can be configured directly from your Sentinel notebook!

Select Features to Use

We start by carefully selecting columns from the SigninLogs that we will use as input features to the ML algorithm. We focus on columns that fall in to at least one of the following categories:

Features that may be randomized by an attacker (e.g., IP addresses, location details, user agent-derived fields)
Features that an attacker may have control over, but where the distribution of “normal” values is concealed from an attacker, making them harder for an attacker to guess (e.g., operating system, browser, city)

We can also limit our data to successful sign-ins and failed sign-ins with result types 50055 (InvalidPasswordExpiredPassword) or 50126 (InvalidUserNameOrPassword) as these are the ones commonly observed in password sprays. (See Azure AD authentication & authorization error codes | Microsoft Docs for full details of the ResultType field.)

ML Deep-Dive

Identify Anomalous Features per Sign-in Attempt

We model each of our features as categorical random variables with the categories being the set of unique values observed from all sign in attempts (both successful and failed). We then use Bayesian parameter estimation with the set of successful sign ins to learn the true distributions for “good” (i.e. non-malicious) sign in attempts. (Since we obviously don’t have perfect “good”/”malicious” labels for all sign-ins, we are using successful sign-ins as a proxy for “good” sign-ins).

Specifically, if C_i is the random variable representing the i-th feature, then

with the symmetric Dirichlet prior,

Estimating p using the posterior mean gives

where denotes our estimate for the probability that the i-th feature of a legitimate sign-in attempt will take the value c, is the number of times that feature i takes the value c in the dataset of successful sign-ins, N is the total number of successful sign-ins in the dataset, and is the number of available categories of feature i. By default, we set 𝛼 = 1, giving an “uninformative” Dirichlet prior for p (this is commonly known as Laplace smoothing).

Setting an Anomaly Threshold on Probabilities

For a given sign-in, we can now make a determination as to whether or not each of its observed properties is anomalous by setting a threshold on the probability (i.e. if the likelihood of a value is less than p_threshold, class it as an anomaly).

In practice, we can't just set a static threshold on the estimated probabilities - what constitutes a good threshold will depend on the distribution of the observed values for that feature. We use an outlier detection technique to dynamically set appropriate thresholds per feature for a given dataset of sign-ins.

As an example, the plots below show geographical distributions of sign-in attempts in two different environments; the red circles/bars indicate countries which have been deemed anomalous. In environment 1, where there is one dominant country from which sign-ins are seen, a calculated threshold of 0.13 is appropriate; in environment 2, where sign-ins are more geographically distributed, a much lower threshold is appropriate (using the same threshold as in environment 1 would result in 75% of all sign-ins being classed as having an anomalous country of origin, introducing a lot of noise to our algorithm).

Figure: Sign-in country probabilities from two different environments overlaid on a map and plotted as bar charts. Circles/bars in red indicate countries from which sign-ins have been deemed anomalous (per environment).

Cluster on Anomalous Features

From the previous step, we now have a dataset of failed sign-ins with a new set of binary features describing whether or not a given property of a sign-in attempt took an anomalous value.

We perform clustering on these failed sign-ins by using latent class analysis: we want to detect underlying classes in the data – for example, the class of failed sign-ins from legitimate user error or the class of sign-ins from a password spray campaign – which, though not directly observable, give rise to different distributions of sign-in features which we can observe.

As an example, suppose that all failed sign-ins come arise from one of: legitimate user error; “password spray campaign 1”, “brute force attempt 1”, “password spray campaign 2”. Whilst it is reasonable to expect that many failed sign-ins from each class will have some anomalous features, the patterns of simultaneously anomalous features may well provide a distinctive fingerprint for each class:

Figure: The black bars represent the distribution we can directly observe from the sign-in data; the challenge is to infer the presence of latent classes, represented here by the colored bars.

Mathematically, we can write this as

Of course, these latent (i.e. hidden) classes cannot be observed directly, but we can work backwards from patterns of associated anomalous features in the SigninLogs data to detect clusters of “similarly anomalous” sign-ins.

This means using a Bernoulli mixture model to model our data and estimating the unknown parameters; the setup can be described as follows

with

(where K, α and γ are parameters for the prior distributions that can be chosen. By default, we use α=10^(-5), γ=0.5 ; the low α helps avoid spurious non-zero sized classes being learned, so that we end up with up to M clusters rather than exactly M.)

We can represent this model using plate notation as shown:

This Bayesian network can be represented in code using the bayespy Python package which we can use to perform variational Bayesian inference to estimate the parameters given above (by maximizing the ELBO statistic). When setting up the network, we typically set the number of classes, K, to the maximum number of classes we want to be able to discover – K=10 has worked well in our testing but will depend on your data!

We can examine the parameters learned by the algorithm by plotting Hinton diagrams. First plotting the class assignment probabilities, it is clear that four classes have been detected, with one dominant class.

Figure: The areas of the squares are proportional to the cluster assignment probabilities

The per-class probabilities for each sign-in feature being anomalous can also be depicted by a Hinton diagram.

Figure: Columns represent clusters and rows represent features, so, for example, the large white square in the 9th column, 3rd row, indicates that, failed logins in cluster #9 are likely to have an anomalous value for "operatingSystem" (feature 3).

Here, we can see that the failed sign-ins from the dominant class only rarely have any anomalous features – we can surmise that this class likely corresponds to legitimate sign-in attempts. On the other hand, the other classes show interesting patterns in the features that are often anomalous.

Finally, we can also use our learned model to assign probabilities to each failed sign-in belonging to a given class.

Prune Clusters

The last step of the main detection algorithm is to prune sign-ins within each cluster anomalous failed sign-ins and then filter the list of clusters, leaving those which are more likely to correspond to low and slow password spray activity.

Minimum thresholds are enforced on the size of the cluster and the number of features being randomized or set to unusual values within a cluster.

Cluster summary after pruning, but before filtering

We also expect low and slow password spray activity to take place over an extended period of time, with attempted sign-ins fairly uniformly distributed over this period (attacker tools now commonly add some random timing-jitter to sign-in attempts to avoid detections relying on very consistent periodic attempts, but this will not affect the uniformity of attempt frequency over a longer periods of time).

Plots: (1) Legitimate sign-ins over a period of several months; (2) Low and slow password spray activity over the same months; (3) Legitimate sign-in times of day, aggregated over several months of data; (4) Low and slow password spray activity times of day.

The plots above show that legitimate sign-ins follow distinctive daily and weekly working patterns. The failed sign-ins from the candidate password spray cluster, however, are noticeably more uniformly distributed. This observation lends credence to the hypotheses that these sign-in attempts represent malicious activity and that they are part of a single campaign.

We numerically express the extent to which sign-ins from a given cluster appear to be uniformly time-distributed by computing the Kolmogorov-Smirnov goodness-of-fit test statistic.

Clusters can then be discarded if they do not span a long enough time period or if the constituent sign-ins do not occur at a steady frequency (i.e., if there are various spikes of attempted sign-in activity, this is unlikely to be a low and slow spray, though the activity may be of interest in its own right).

Further Analysis and Visualizations

By this stage, we have managed to reduce a huge dataset of sign-ins down to a manageable set of candidate low and slow password spray campaigns.

Using the various built-in security analytics tools and visualization capabilities of Microsoft's MSTICPy Python package, we can better understand the nature of each candidate low and slow password spray cluster. (Refer to MSTICPy's data enrichment and visualization documentation for further details and example notebooks.)

This type of data exploration can be important for validating the outputs of any algorithm as well as understanding how to isolate, mitigate and block malicious sign-in attempts associated with particular clusters. For example, if sign-ins from a particular campaign use randomized IP addresses, but a static, anomalous user agent, the user agent could be used as the basis of remediation steps (see below).

Identifying Potential Spray Successes

Once we have identified patterns in anomalous features to fingerprint potential low and slow password sprays, we can hunt for the successful sign-ins that exhibit the same pattern of anomalous features. This is an important step in understanding and addressing the full impact of any malicious activity in your environment.

Low and slow spray activity is sometimes due to an actor testing credentials that have been stolen via infostealers to verify that they work prior to their sale. This typically manifests as anomalous-looking successful sign-ins without any follow-on activity for some time, with the main malicious activity on the account not commencing until a later date. This reinforces the importance of timely response to any apparent password-spray like event before stolen accounts can be used to further an attack on a target or as a staging point to target others (e.g., via phishing).

Next Steps

Customizable Sentinel Incidents

At the end of the notebook, “Potential Low & Slow Password Spray Activity” incidents are created in Sentinel, providing a summary of candidate low and slow password spray campaigns that have been detected.

Customizable incident created in Sentinel

Customizable summary data sent to Sentinel

How can you use this information to protect your organization?

Isolate and block malicious activity

Sign-ins from a particular campaign may use randomized IP addresses but a specific list of anomalous user agents, for example. These user agents could be used as the basis of a custom Sentinel analytic rule to monitor for success. Alternatively, the user agents could be blocked using conditional access.

Configure additional monitoring

Once the user accounts being targeted are identified, watchlists can be created in Sentinel, identifying higher risk accounts for daily monitoring. Custom analytic rules could be created, making use of the watchlist to apply stricter account monitoring.

Gain further insights and issue guidance

Enumerating accounts potentially targeted by the password spray can give insight into the nature and scope of the attack. For example, if only administrative accounts are targeted, this suggests that the threat actor has done some reconnaissance prior to the attack. If a threat actor is trying seemingly random accounts, they may be attempting passwords based on a recent data leak which included employee email addresses.

Further insight into the campaign can allow advice to be more clearly targeted, including issuing reminders about password reuse, or targeted advice to administrators about ongoing threats from more motivated threat actors.

Correlate user accounts and security alerts

If a password spray campaign has been running for a longer period, the target user accounts could be correlated with sentinel security alerts to determine if a successful compromise has taken place. Other Microsoft security products, such as Azure AD Identity Protection may also raise alerts when an attack is successful (see our blog post, Advancing Password Spray Attack Detection).

Querying alerts generated by Azure AD Identity Protection in Microsoft Sentinel

Respond and Investigate

If you suspect a password spray campaign has been successful, then Microsoft’s password spray investigation playbook can be used as a starting point for investigation. Further guidance on how to identify and address potential password sprays is available in previous Microsoft security blog posts:

Summary

This new Sentinel notebook implements a novel methodology using machine learning to surface potential low and slow password spray attacks. It also demonstrates the use of the Azure Synapse integration for Sentinel to facilitate highly scalable advanced analytics and ML against large log datasets directly from a Sentinel notebook.

Finally, we make use of Microsoft's MSTICPy library to quickly and easily enrich, contextualize and visualize data for the candidate password spray campaigns, allowing for deeper understanding of the ML outputs. This allows for easier validation and pruning of the clusters produced by the algorithm by security analysts before the results are used to create meaningful Sentinel incidents for further investigation and response.

Updated Aug 13, 2022

Version 2.0