Forum Discussion

majo01's avatar
majo01
Brass Contributor
Jul 20, 2020

KQL Performance Optimization

 

Hello folks,

I am building  query that basically does the following :

 

1-  Extend and Project fields from Table1, which contains syslogs

2- Summarize table fields mentioned in (1)

3- Join the summarized table with a static datatable (Table2)

 

The performance is poor, it frequently hits the 10 minutes limits. I did troubleshooting by isolation; decomposing the query into sub-queries to pinpoint the problematic part, and found it's the summarize line!

 

Acting on the same dataset: Extend/Project operations take few seconds, JOIN takes few seconds, Summarize hits the 10 mins limit

 

QUESTION1: What are the options to improve the performance of the Summarize operation? I tried hint.strategy=shuffle but it didn't change the situation.

 

QUESTION2: How to review/confirm the settings of Data Partitioning Policy and Shuffle strategy  in Sentinel ?

 

Tried both lines:

| summarize  count() by app;

| summarize hint.strategy=shuffle count() by app;
 

 

13 Replies

  • mergene's avatar
    mergene
    Brass Contributor
    Could you share the join part of the query?
    If your left table is bigger than the right table, it is recommended to use lookup operator for the performance. https://docs.microsoft.com/en-us/azure/data-explorer/kusto/query/lookupoperator

    Apart from that, what happens if you try the following?
    staticTable
    | join (
    syslogData | extend columns
    ) on common columns
    | summarize
  • majo01 

     

    I suspect you have read this: https://docs.microsoft.com/en-us/azure/data-explorer/kusto/query/best-practices

     

    Is the summarize that's having the issue within the Join or outside the join?  If the data is Syslog, I assume its the Syslog table and you have that on the right of the join - see link above  (the datatable being the smaller table on the left?).  Are you able to reduce the data from syslog, maybe use a bin if its across many hours or days?


    Can you offer a sample of the datatable and Syslog syntax?  

     

     

     

    • majo01's avatar
      majo01
      Brass Contributor

      CliveWatson 

      Yes, i read it.

      Summarize is outside the JOIN, and  i confirmed that it's the culprit, not the JOIN. Any ideas to improve the performance of JOIN, given that the dataset it's acting on it very huge (tens of millions of logs).

       

      //PART1: Logs preparation, note that syslog1 is a function created elsewhere:

      Syslog1

      | where TimeGenerated > startDatetime and TimeGenerated < endDatetime
      | where Activity == "TRAFFIC" or Activity == "THREAT"

      | project-rename signature = ThreatName
      | project-rename threat_category = ThreatCategory
      | project-rename category= ThreatType, severity = SeverityLevel, action = DeviceAction, content_version = ContentVersion

      | extend app = tostring(Application), day = dayofmonth(TimeGenerated), month = getmonth(TimeGenerated), year = getyear(TimeGenerated)
      | extend traffic = toint(Bytes)
      | extend parsed_src_IP = parse_ipv4(SourceIP)
      | extend src_zone = case(parsed_src_IP between ( parse_ipv4("192.168.0.0") .. parse_ipv4("192.168.255") ),"Guest",
      parsed_src_IP between ( parse_ipv4("127.0.0.0") .. parse_ipv4("127.0.0.1") ),"Corporate",
      parsed_src_IP between ( parse_ipv4("10.0.0.0") .. parse_ipv4("10.255.255.255") ),"Corporate",
      parsed_src_IP between ( parse_ipv4("172.16.0.0") .. parse_ipv4("172.31.255.255") ),"Corporate",
      //else
      "Public")
      | extend parsed_dst_IP = parse_ipv4(DestinationIP)
      | extend dst_zone = case(parsed_dst_IP between ( parse_ipv4("192.168.0.0") .. parse_ipv4("192.168.255") ),"Guest",
      parsed_dst_IP between ( parse_ipv4("127.0.0.0") .. parse_ipv4("127.0.0.1") ),"Corporate",
      parsed_dst_IP between ( parse_ipv4("10.0.0.0") .. parse_ipv4("10.255.255.255") ),"Corporate",
      parsed_dst_IP between ( parse_ipv4("172.16.0.0") .. parse_ipv4("172.31.255.255") ),"Corporate",
      //else
      "Public")

       

      // Problematic Part which is the following summarize statement

      | summarize sum(traffic), count(app) by app, src_zone, dst_zone

       

      // PART3:

      JOIN

      • Ofer_Shezaf's avatar
        Ofer_Shezaf
        Icon for Microsoft rankMicrosoft

        majo01 :

         

        I think there is little to do with "summarize" beyond shuffle. Is the performance prior to summarize reasonable? 

         

        One option is to filter before parsing. If it is easy to check Activity on raw syslog, it should accelerate things considerably. I woudl first check by including the parser in the query. If it works, the parser can be omdified to be a parameter function that can be applied after initial filtering.