Anomaly detection - how to

Occasional Contributor

Hi - I would like to detect anomalies across multiple fields that are not numeric (e.g. looking for unusual azure ad sign-in events using source IP, app name, account name, client name). To the best of my reading, Sentinel/kusto has time series analytic capabilities and can easily detect anomalies - however only on one continuous numeric field.


What I'm looking for is a way to perform anomaly detection when the event data is categorical (IP addresses, account names), rather than numeric. Splunk has a really convenient "anomalydetection" function that takes a list of fields, then computes the probability of each combination of fields in the source data, and filters to only the most unlikely events. This is exactly what I am after, but can't figure out how to do it in Sentinel. Any pointers / guides?


3 Replies
Could you be more specific? what kind of anomaly do you want to detect exactly? if you give an example, that would be better. you can count by IP address and other fields, then use the anomaly detection if you are trying to detect anomalies based on numbers. there are some ML functions you can use to detect anomalies as well. evaluate baseket() and evaluate autocluster() can be used to detect anomalies.
Hi Cyb3rMonk I want to identify unusual sign-in activity in Azure AD logs so that these can be investigated as potential compromised accounts. As a really simple example - I want to consider events fields (i) the UPN and and (ii) the country from the location field. - I consider an unusual event to occur when a sign-in occurs from a country that is not typical for each user For example, I rarely ever travel and live in one isolated country, so my signins each day always come from that one country. If a signin happens from a different country then that's an anomaly that needs to be investigated. In practice, by considering events fields USN, AppDisplayName and the location (or even better the IP ASN), a small number of unusual events can be identified. I typically use the same set of apps, at work (corp network), on the bus (cell phone carrier) and then at home (residential xdsl). All of the examples that I've seen using sentinel (e.g. summarise events to a numeric series (e.g. number of locations that a user signed in from per day) and then look for outliers in the count. In practice this event is fallible - because one of the locations in the count could be highly unusual while the count is still numerically normal. Our most important users do travel regularly so their normal pattern of use is more complex than most people, making count based approaches less effective and more likely to miss something significant.

@mrboxx you can create a baseline data and compare the last 1d of data with your baseline by using join.  There are several ways to accomplish this. The below is an example:

// Logic: create a baseline by using data from 15 days ago until 1 day ago.
//        compare the last 1d of data with the baseline
let startdate=15d;
let enddate=1d;
let baseline = materialize ( SigninLogs
| where TimeGenerated between ( ago(startdate) .. ago(enddate)) 
| where OperationName == "Sign-in activity"
| extend countryOrRegion_ = tostring(LocationDetails.countryOrRegion)
//| summarize Country_=make_set(countryOrRegion_) by Identity, bin(TimeGenerated, 1d)
| summarize max(TimeGenerated) by Identity, countryOrRegion_, bin(TimeGenerated,1d)
let countries_by_identity = baseline
| summarize previous_countries=make_set(countryOrRegion_) by Identity;
let existing_users = baseline
| summarize make_list(Identity);
| where TimeGenerated > ago(1d)
| where OperationName == "Sign-in activity"
| where Identity in~ (existing_users) // to remove the false positive where an identity is first seen. 
| extend countryOrRegion_ = tostring(LocationDetails.countryOrRegion)
| summarize LastSigninActivity=max(TimeGenerated) by Identity, countryOrRegion_
| join kind=leftanti baseline on Identity, countryOrRegion_
| join kind=inner countries_by_identity on Identity
| project-away Identity1