SOLVED

Remove duplicates from query

Occasional Contributor

Hi, hope somebody can help me as I'm a bit stuck in my understanding of the query language.

So I'm trying to get some creation events for App Services, though there seems to be multiple entries for the same App. Therefore I'm trying to find a way to remove duplicates on a column but retain the rest of the columns in the output / or a defined set of columns. Though after dodging distinct on a specific column only this is retained in the output.

This is my query:

 

AzureActivity 
| where OperationName == 'Delete website' and ActivityStatus == 'Succeeded' and ResourceProvider == 'Azure Web Sites' 

Though this produces two entires for the same deletion, so I tired this:

 

AzureActivity 
| where OperationName == 'Delete website' and ActivityStatus == 'Succeeded' and ResourceProvider == 'Azure Web Sites' 
| distinct CorrelationId

Though this only leaves the CorrelationId in the output but I need the Resource, ResourceID,OperationName also to be shown in the output. Any tips on how to get the syntax correct?

 

Thanks

 

 

 

9 Replies

Hi There is some basics that you need to understand about Log Analytics. Data in Log Analytics is stored with different time stamp (TimeGenerated column). So basically when a solution sends new data it does not delete the old one it just writes a new record with new date. That way you can view historically the data. With that said for example if you want to get the latest record for something there is function for that arg_max(). You can read about it here:

 

https://docs.loganalytics.io/docs/Language-Reference/Aggregation-functions/arg_max()

 

I am not sure what you want to achieve but may be it is something like this:

 

AzureActivity 
| where OperationName == 'Delete website' and ActivityStatus == 'Succeeded' and ResourceProvider == 'Azure Web Sites' 
| summarize arg_max(TimeGenerated, *) by CorrelationId

 You might need to aggregated on another column or more than one column depending on your scenario. 

Thanks Stan, sorry for the bad description.

 

So my original query to get information about deleted App Services returns multiple events for the deletion of the same App. Though these have the same CorrelationId. Therefore I was trying with my limited knowledge of the query language to only return 1 record that has the same CorrelationId. Though still get all the columns the event contains. 

 

I'm trying to use this to drive a webhook and trigger a runbook to do some house cleaning on both creation and deletion of App Services. 

 

My thinking is that the best way to drive the runbook was to use the query language to filter out all the data that the runbook does not need to make its logic work. 

 

Hopefully this makes more sense?

best response confirmed by Morten Lerudjordet (Occasional Contributor)
Solution
So does arg_max() does the job? From my understanding is the thing that will work for you.

If one just ignores time at the moment this almost gets me there but it will only retain CorrelationId on the result output, but I need to get all the columns (or use project to choose the ones I need)

 

AzureActivity 
| where OperationName == 'Delete website' and ActivityStatus == 'Succeeded' and ResourceProvider == 'Azure Web Sites' 
| summarize by CorrelationId

Hmm the query you've posted last is different from the one I've posted. It does not contain arg_max(). arg_max() retains all columns.  If you do not want to use arg_max() when you get two results what is the difference between them? Which one of the two you want to be displayed only?

Think I maybe figured it out?

 

Something like this seems to work

AzureActivity 
| where OperationName == 'Delete website' and ActivityStatus == 'Succeeded' and ResourceProvider == 'Azure Web Sites' 
| project ResourceId, CorrelationId, TimeGenerated 
| summarize arg_max(TimeGenerated, *) by CorrelationId
Yes. Forgot the '*' in arg_max() sorry. It is best practice to project at the end after summarize

Seems I can also do this, so I dont need to do project at the end:

AzureActivity 
| where OperationName == 'Delete website' and ActivityStatus == 'Succeeded' and ResourceProvider == 'Azure Web Sites' 
| summarize arg_max(TimeGenerated, ResourceId, Resource, ResourceGroup, OperationName) by CorrelationId

Thanks for steering me in the right direction.

The beauty of the new syntax. There are many roads to Rome :) . I would assume the last query you've posted takes less time to execute compared to the previous ones. That is probably the only difference. No problem! Glad to help!