Senior Member

Calculating traveled distance from coordinates on works but aggregation causes abort

I have location data which is timestamped. I use the data to calculate the distance traveled.

First I have the function which is used for calculating the distance between coordinates:

```let getDistance = (lat1:double, lon1:double, lat2:double, lon2:double)
{
let R = 6371;
let a = sin(dLat/2) * sin(dLat/2) + cos(radians(lat1)) * cos(radians(lat2)) * sin(dLon/2) * sin(dLon/2);
let c = 2 * atan2(sqrt(a), sqrt(1-a));
R * c
};```

Then the query:
```LocationData_CL
| where TimeGenerated > ago(60d)
| where AssetId_s == "test"
| order by TimeGenerated
| project TimeGenerated, distance = getDistance(Lat_d, Lon_d, prev(Lat_d), prev(Lon_d))```

This works fine and returns 390 results.

Now i try to aggregate daily distance:
`| summarize sum(distance) by bin(TimeGenerated, 1d)`
This fails:
`QUERY ABORTEDThe query was aborted by the server.Details: It looks like the accumulated data on the server exceeded the limit.Support id: e15414e1-8bab-47b9-aaab-6638d1e91c9f`
Is there a way I can make this query work?

2 Replies

Re: Calculating traveled distance from coordinates on works but aggregation causes abort

Hi,

This error is very rare and used only for queries that hog memory resources above very high threshold. There is a fraction of a percentage of those every day. There is usually a way to optimize them and make them run smoothly.
If you can contact me directly (meirm@microsoft), I would try to assist you.

Thanks,
Meir

Re: Calculating traveled distance from coordinates on works but aggregation causes abort

Looking at your query, there is one thing that is completely redundant and may cause it to blow out of memory: the "order" before the summarize. It doesn't make sense to order something before you summarize it.

Performance-wise, you kill the parallelism of the system as ordering means that the system have to unify the results from all nodes and then to summarize while regular summarize using bins and sum functions can be done almost completely in parallel. In highly distributed system like ours (we have some clusters of more than 100 nodes) this has huge effect. I'm not saying this is the case but it is an option.