This article expands on the time series analysis example given in the "Machine learning powered detections with Kusto query language in Azure Sentinel" Azure blog post.
Scenario: identify user accounts authenticating from an unexpectedly large number of locations. The intuition is that these accounts may be of security interest, and potentially compromised.
This Kusto tutorial discusses using time series analysis to investigate change patterns in data using the make-series operator and series_fit_line function from the Kusto language used in Azure Log Analytics. This post describes a possible application of such techniques in a security context.
Note that for simplicity we are not evaluating the reachability of one sign-in location from another – clearly that is an important consideration and indeed Azure Active Directory runs sophisticated analysis to provide eventing and alerts for such impossible travel scenarios.
For the purposes of this example we restrict ourselves to the count of distinct locations and to hunting for ‘the most unusual’ sign-in activity – even if that is below the threshold that would result in an alert.
A typical organization may have many users and many applications using Azure Active Directory for authentication. Some applications (for example Office365 Exchange Online) may have many more authentications than others (say Visual Studio) and thus dominate the data. Also users may have a different location profile depending on the application – high location variability for email access may be expected, but less so for development activity associated with Visual Studio authentications for example. For both these reasons it may be desirable to track location variability for every user/application combination and then investigate just some of the most unusual cases.
Analysis
The time series analysis make-series and series_fit_line operators allow just that. Our starting point is the Azure Active Directory sign-in logs – stored in the SigninLogs table in Azure Log Analytics:
SigninLogs
| extend locationString= strcat(tostring(LocationDetails["countryOrRegion"]), "/", tostring(LocationDetails["state"]), "/", tostring(LocationDetails["city"]), ";")
| project TimeGenerated, AppDisplayName, UserPrincipalName, locationString
The next steps are:
<previous query text>
| make-series dLocationCount = dcount(locationString)
on TimeGenerated from datetime(01-01-2019) to datetime(01-31-2019) step 1d
by UserPrincipalName, AppDisplayName
Each series vector in the result set represents the number of locations for a given account/application pair:
<previous query text>
| extend (RSquare,Slope,Variance,RVariance,Interception,LineFit)=series_fit_line(dLocationCount)
// Chart the 3 most interesting lines
// 0 slope corresponds to completely stable over time
| top 3 by Slope desc
| render timechart
A completely stable profile over time – constant number of locations – will lead to a horizontal line – i.e. a slope of zero.
A spike in number of sign-in locations translates to a positive slope value, so of all the best-fit lines – each line corresponding to a particular user/application combination - we can pick those with the largest slope values.
The top slope values across all the best fit lines in a sample test set were around 0.2 – 0.3:
The graph below shows the location count for these users over time – the typical pattern of 0 or 1 sign-in locations daily for these user accounts increased to 6-8 sign-in locations daily. Are these locations legitimate – that’s the starting point for investigation…
Tim Burrell, Microsoft Threat Intelligence Center
April 2019
Appendix
Final consolidated query described in the main text
SigninLogs
| extend locationString= strcat(tostring(LocationDetails["countryOrRegion"]), "/", tostring(LocationDetails["state"]), "/", tostring(LocationDetails["city"]), ";")
| project TimeGenerated, AppDisplayName , UserPrincipalName, locationString
// create time series
| make-series dLocationCount = dcount(locationString) on TimeGenerated from datetime(01-01-2019) to datetime(01-31-2019) step 1d
by UserPrincipalName, AppDisplayName
// Compute best fit line for each entry
| extend (RSquare,Slope,Variance,RVariance,Interception,LineFit)=series_fit_line(dLocationCount)
// Chart the 3 most interesting lines
// 0 slope corresponds to completely stable over time
| top 3 by Slope desc
| render timechart
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.