ATP sensor Consume most server CPU (60%)

Copper Contributor

It is observed after installing ATP sensor, on domain controller, that more than 60 % of 16 cores CPU are consumed by microsoft.tri.sensor.exe component

15 Replies

The sensor (if installed on a DC) will make sure at least 15% of RAM and CPU are free at all time.

other wise it will try to utilize any free resources to reduce data latency.
if the machine will get more busy, the limits will be adjusted and it will utilize less resources.

(It will auto adjust within ~ 10 sec)

 

Plus, if it's a new deployment, and this instance is a synchronizer candidate, it's expected to work harder during the first few hours until the initial AD sync completes.  

 

For a standalone sensor, it assumes it can use all the machine resources without limits.

What Eli described is documented here, under the "Resource limitations" bullet item. 

Dear All,

Thanks for your support, it is observed after updating the current version to 2.60.6070.18946 the issue of CPU has been resolved.

@EliOfek We are facing very high CPU usage with one of our DCs. The sensor process seems to occupy most of the resources on this DC. COuld you please share some insight on remediating this? Also, we are seeing too many 8.8.8.8 connections on this server, not sure if this is linked! Any leads would be appreciated as always!

@mesaqee , it seems that the total CPU on the machine is 74%, so technically speaking there is no issue, and the sensor is not even throttling at this point.

Such consumption might be expected for high traffic scenarios.

What did the sizing tool had to say about this machine?
What is the hardware spec ? what is the busy packets/sec and max packets/sec ?

 

the sensor itself won't initiate connections specifically to 8.8.8.8, but if you are running a DNS service on the machine that will except connections from 8.8.8.8, then it is expected that the sensor will try to get  back to this endpoint to try and resolve it. most likely it's not related to the CPU usage.

Dear @EliOfek , 

 

The server is based on a VM, attached are the complete hardware specs. The busy packets/sec=511 and max packets/sec=32,295.

 

Please see below the complete sizing tool output:

 

DCSensor SupportedFailed SamplesMax Packets/secAvg Packets/secBusy Packets/secBusy Packets/sec Start TimeBusy Packets/sec End TimeMin Avail MBAvg Avail MBBusy Avail MBBusy RAM Start TimeBusy RAM End TimeTotal MBMax % CPU TimeAvg % CPU TimeBusy % CPU TimeBusy CPU Start TimeBusy CPU End TimeLogical processorsProcessor GroupsCore CountVM IndicatorAD SiteTime Zone NameIs DSTOS CaptionOS Build NumberOS Installation TypeOS Server Levels
XXXXXYes, but additional resources required: +1GB; +1 core832,29510551119:51:5220:21:502,3664,3493,32317:12:1617:42:148,19110050982:17:122:47:30212VMWareXXXXX(UTC+08:00) Beijing, Chongqing, Hong Kong, Urumqi Microsoft Windows Server 2019 Standard17763ServerServerCore; ServerCoreExtended; Server-Gui-Mgmt; Server-Gui-Shell
Was one core added as suggested?
While the busy packets are low, the max is pretty high...
Is the high CPU you noticed is constant or spikes on certain hours ?
No, the core hasn't be added as this issue has started coming up since last week only. The spike has been there almost constantly. We are still monitoring that to evaluate if this is intermittent or a consistent issue. Do you have any other suggestions apart from the core addition?
Check the packets/sec on all the nics, or re run the sizing tool, maybe there was an increase of traffic load on this machine, but nothing seems wrong here , especially if the sizing tool asked for another core and it wasn't deployed.

Dear@EliOfek ,

 

The issue is still there even after increasing the server capacity.

We re-ran the sizing tool, and here are the results:

 

 

DCSensor SupportedFailed SamplesMax Packets/secAvg Packets/secBusy Packets/secBusy Packets/sec Start TimeBusy Packets/sec End TimeMin Avail MBAvg Avail MBBusy Avail MBBusy RAM Start TimeBusy RAM End TimeTotal MBMax % CPU TimeAvg % CPU TimeBusy % CPU TimeBusy CPU Start TimeBusy CPU End TimeLogical processorsProcessor GroupsCore CountVM IndicatorAD SiteTime Zone NameIs DSTOS CaptionOS Build NumberOS Installation TypeOS Server Levels
XXXXX.localYes02,26413918109:51:4710:06:454,6235,1934,72504:03:5204:18:4910,239100116912:58:1713:13:20414VMWareHXXXX(UTC+08:00) Beijing, Chongqing, Hong Kong, Urumqi Microsoft Windows Server 2019 Standard17763ServerRemote Registry Query Failed

 

Appreciate if you can provide some more insights around this.

 

 

Thanks,

Saqib

What is the total CPU usage pattern after the additional core was added?

@EliOfek Well after upgrading the capacity the usage was pretty high and we were even getting the "Some network traffic could not be analyzed" alert. However, we are now unable to see the trend as the Sensor service is not starting. On trying to manually start it, we are getting the below error:

 

mesaqee_0-1616578491368.png

 

We have tried rebooting the server and re-installed the sensor, but the service is still not running. Shall we contact support to have a detailed look or you got any further suggestions?

 

Thanks,

Saqib

 

Check the local logs of the sensor and updater to see if there is any clue to why it fails to start.
Also make sure you have the WmiApSrv service starting correctly.
If no clues, you need a support ticket to go forward...
Has anyone else seen this issue? I've been experiencing high CPU usage/exhaustion due to tri.sensor and tri.sensor updater over the last week or so. I followed links in this thread but did not see anything that helps the issue.

@aafar Your description is too generic to spot the root cause.
This is something that would best be handled via a support ticket, as given the machine details we can track it's telemetry remotely to better understand what is happening and why.

General note: assuming this is not a standalone sensor you are talking about (in which case it's normal for it to consume all available resource), the sensor is designed to throttle it's cpu and memory consumption to make sure the server has at least 15% free cpu/memory  at all times.
So while high cpu can be "by design" if traffic on the machine is high, exhaustion  should not really happen unless something really bad happen and the resource manager fail to do it's job, which is unlikely.