Implementing and Optimizing Bounding Box Queries in Azure Data Explorer

Published May 25 2022 07:12 AM 939 Views
Microsoft

Implementing and Optimizing Bounding Box Queries in Azure Data Explorer

Overview

For our current project, we are capturing into ADX the location of vehicles over time. Our customer asked us to create a function that would return all vehicles that are within a given bounding box in a given time period. This is useful information when they want to know when a vehicle returns to a building, a neighborhood, or a city.

In this article, I will show how this can be accomplished using built-in functions, the limitations of those functions, and ways to overcome those limitations.

The Data

Our application captures time series and geographic data. The AssetHistory table contains information about the location of vehicles over time. The structure of this table is:

Column

Data Type

Description

timeStamp

datetime

The time of the event

vehicleId

string

Unique ID of the vehicle

lat

double

Location latitude of the vehicle at that time

lon

double

Location longitude of the vehicle at that time

 

Here is a small sample of data in this table:

timeStamp

vehicleId

lat

lon

2022-01-01T00:00:00Z

VEH00001

39.5537192488917

94.68388959051

2022-01-01T00:00:00Z

VEH00002

39.4831286543648

94.99069681588

2022-01-01T00:01:00Z

VEH00001

39.5595318673797

94.695714322373

2022-01-01T00:01:00Z

VEH00002

39.4834279222404

94.9959276770247

2022-01-01T00:02:00Z

VEH00001

39.5452045858879

94.6948031060648

2022-01-01T00:02:00Z

VEH00002

39.4937282808629

94.9981501288595

2022-01-01T00:03:00Z

VEH00001

39.5500482516451

94.6822965012806

2022-01-01T00:03:00Z

VEH00002

39.5020828393819

94.9988039760402

2022-01-01T00:04:00Z

VEH00001

39.5583474964348

94.68176047996

2022-01-01T00:04:00Z

VEH00002

39.5021519675971

95.0002670662021

2022-01-01T00:05:00Z

VEH00001

39.5476946907459

94.6897078041742

2022-01-01T00:05:00Z

VEH00002

39.4830067290929

94.9939493859201

2022-01-01T00:06:00Z

VEH00001

39.5517730215806

94.6972633402465

2022-01-01T00:06:00Z

VEH00002

39.4892456462058

94.9906174555097

       

 

Imagine this dataset extending for many hours, days, or months as the vehicles move over time. For our project, we generated 200 million rows of sample data for dozens of vehicles every 60 seconds that moved north and east over time.

Built-In Function

ADX provides the built-in geo_point_in_polygon function that does exactly what we need. This function accepts 3 arguments: the latitude and longitude of a given location and a polygon object. It returns true if that location is within the bounds of the polygon; otherwise, it returns false.

However, this function does have limitations which I will describe after an example.

Using KQL, we can create a polygon with the following syntax:

let p1Lon = -93.36181640625;
let p1Lat = 39.838068180000015;
let p2Lon = -90.362548828125;
let p2Lat = 39.838068180000015;
let p3Lon = -90.362548828125;
let p3Lat = 39.972910551318989;
let p4Lon = -93.36181640625;
let p4Lat = 39.972910551318989;
let p1 = pack_array(p1Lon, p1Lat);
let p2 = pack_array(p2Lon, p2Lat);
let p3 = pack_array(p3Lon, p3Lat);
let p4 = pack_array(p4Lon, p4Lat);
let polygon = pack("type","Polygon","coordinates", pack_array(pack_array(p1, p2, p3, p4, p1)));

Then, we can write a query using geo_point_in_polygon to determine which rows are within this polygon, as shown below:

let startTime = datetime('2022-01-01T21:25:03Z');
let endTime = datetime('2022-01-01T21:33:06Z');
AssetHistory
| where timestamp between (startTime .. endTime)
    and geo_point_in_polygon(lon, lat, polygon)

This query will return only those rows within the time range that are inside the polygon, as shown in the results below:

timestamp

vehicleId

lat

lon

2022-01-01T21:26:00Z

VEH00001

-93.1514131769570

-93.151413176957

2022-01-01T21:26:00Z

VEH00002

-92.9590414313576

-92.9590414313576

2022-01-01T21:27:00Z

VEH00001

-93.1505516077601

-93.1505516077601

2022-01-01T21:27:00Z

VEH00002

-92.9584569148338

-92.9584569148338

2022-01-01T21:28:00Z

VEH00001

-93.1503058581473

-93.1503058581473

2022-01-01T21:28:00Z

VEH00002

-92.9575508394116

-92.9575508394116

2022-01-01T21:29:00Z

VEH00001

-93.1495162302751

-93.1495162302751

2022-01-01T21:29:00Z

VEH00002

-92.9570576871067

-92.9570576871067

2022-01-01T21:30:00Z

VEH00001

-93.1493412918965

-93.1493412918965

2022-01-01T21:30:00Z

VEH00002

-92.9568601706845

-92.9568601706845

2022-01-01T21:31:00Z

VEH00001

-93.1491006937533

-93.1491006937533

2022-01-01T21:31:00Z

VEH00002

-92.9559658727279

-92.9559658727279

2022-01-01T21:32:00Z

VEH00001

-93.1489732277653

-93.1489732277653

2022-01-01T21:32:00Z

VEH00002

-92.9559072328652

-92.9559072328652

2022-01-01T21:33:00Z

VEH00001

-93.1480159553770

-93.148015955377

2022-01-01T21:33:00Z

VEH00002

-92.9551552238215

-92.9551552238215

 

Issues with this approach

The simplicity of the geo_point_in_polygon function makes it appealing, but there is a drawback. If we filter on a wide date/time range, it is possible that only a small percentage of the rows in this range will fall within the polygon. This is problematic because, although the query will take advantage of indexes when filtering by time, it will perform a table scan after this time filter. This can be inefficient if nearly all the rows are outside the polygon.

We can address this by doing some calculations in advance. Two possible approaches are:

  • Calculate S2 cell values in advance
  • Calculate in advance into which polygon(s) each row falls

Optimize Using S2 Cells

S2 cells divide the Earth into areas of equal size. Each S2 cell is a rectangle projected onto the spherical surface of the Earth. The size of each S2 cell is dependent on its level, which can range from 0 to 30. Lower-numbered level cells are much larger and divide the Earth into only a few cells, while higher number levels are tiny and divide the Earth into many cells.

Fig. 1 illustrates the Earth divided into S2 cells at levels 0-4.

 

DavidGiard_0-1653398302598.png

Fig. 1

 

You can read more about S2 cells here.

Kusto Query Language ("KQL") contains a number of functions for dealing with S2 cells. We can use these functions to optimize our queries. Two useful functions are geo_point_to_s2cell and geo_polygon_to_s2cells. The geo_point_to_s2cell function returns a unique ID for an S2 cell, given a longitude, latitude, and S2 level. The geo_polygon_to_s2cells function returns an array containing the minimum S2 cells that will cover a polygon.

We can use these functions to optimize our query by pre-calculating S2 cells for each row. Start by creating a Materialized View that adds one or more S2 columns, as in the following example:

Next we create a Materialized View that calculates into which polygon each row falls.

.create async materialized-view 
with (backfill=true, docString='Telemetry data in 1-hour bins with S2 calculated' )
AssetHistoryView
on table AssetHistory { 
  AssetHistory
  | extend s2_15 = geo_point_to_s2cell(lon, lat, 15)
| summarize take_any(lon, lat, s2_15) 
      by vehicleId, timestamp
}

 

The materialized view is updated with every row inserted into AssetHistory, so it calculates the S2 value for each row. We can then use this view in place of the table and take advantage of the calculated S2, as in the following query.

let s2_15_Array = geo_polygon_to_s2cells(polygon, 15);
let devicesInS2Boxes = 
    AssetHistoryView
    | where timestamp between (startDateTime .. endDateTime)
    and fleetId == filterFleetId
    and s2_15 has_any (s2_15_Array);
devicesInS2Boxes
  | where geo_point_in_polygon(current_lon, current_lat, polygon)
  | order by deviceId asc, timestamp asc
  | project deviceId, timestamp, current_lon, current_lat, s2_15
}

This query yields the same results as the one above that does not use S2 cells, but it is potentially faster. This is because we filter our results on an array of S2 cells that is slightly larger than the polygon before we filter on the polygon function.

Figures 2 - 5 illustrate this filtering.

Fig. 2, shows a sample polygon.

Fig. 3 shows a sample of vehicle data points over time.

In Fig. 4, we overlap the polygon with an array of S2 cells. These cells completely cover the polygon and cover an area slightly larger than the polygon.

Fig. 5 shows the filtering of vehicle data points to only those within the S2 cells. When the query does a scan to filter on geo_point_in_polygon, it now has fewer rows to scan.

DavidGiard_1-1653398302651.png

Fig. 2:  a sample polygon

 

DavidGiard_2-1653398302697.png

Fig. 3: a sample of vehicle data points over time

 

DavidGiard_3-1653398302742.png

Fig. 4: the polygon overlapped by S2 cells

 

DavidGiard_4-1653398302789.png

Fig. 5: filter vehicle data points to only those within the S2 cells

 

We can get similar results by applying a policy to our AssetHistory table that populates another table with the fields in AssetHistory, along with the calculated S2 column. The following code creates a function to calculate an S2 value for each row and a policy to populate a table with this value, along with all the columns in the AssetHistory table:

.create-or-alter function with (docstring = "Calculate S2 Value for AssetHistory", skipvalidation = "false" )
calculateS2ForAssetHistory() {
    AssetHistory
        | project
            timeStamp, 
            vehicleId, 
            lat, 
            lon, 
            s2_15 = geo_point_to_s2cell(lon, lat, 15)
}

.alter table AssetHistoryWithS2
    policy update
    @'[{"IsEnabled": true, "Source": "AssetHistory", "Query": "calculateS2ForAssetHistory()", "IsTransactional": true}]'

We can then query the AssetHistoryWithS2 table in the same way that we queried the AssetHistoryView materialized view above to speed up our geo queries.

This approach works well in optimizing the bounding box query, but it has some disadvantages. You need to decide in advance which S2 level to use. This will be based on the size of your polygon, as you want the array of S2 cells to be slightly larger than the polygon. You can find a list of the approximate size of each S2 level cell here. The ADX array also has limitations. Also, you may run into memory issues if you array of S2 cells becomes too large.

KQL also supports H3 geographic cell types and includes functions similar to those supported for S2. We chose to use S2 because the 4-sided cells are simpler and more intuitive to work with than H3's hexagons.

Optimize By Pre-Calculating for Well-Known Polygons

Another approach is to determine in advance into which polygon each row's location falls. This works well if we know the polygons we wish to use. For example, the client may repeatedly ask which vehicles are inside the same set of buildings or neighborhoods or cities.

To do this, we create a Polygons table that contains the coordinates of the polygons. This table consists of 2 columns:

Column Name

Data Type

Description of column

description

string

A descriptive name for this polygon

polygon

dynamic

geojson representation of the polygon

 

To do this, we create a Polygons table that contains the coordinates of the polygons. This table consists of 2 columns:

 

Column Name

Data Type

Description of column

description

string

A descriptive name for this polygon

polygon

dynamic

geojson representation of the polygon

 

The script to create this table is:

.create-merge table Polygons(
description: string,
polygon: dynamic)

Below is some sample code to populate the Polygons table containing 3 polygons:

.set-or-append Polygons <|
print name = "Polygon1", polygon=dynamic({"type": "Polygon","coordinates": [[[-93.44558715820312,39.573939343591896],[-92.43759155273436,39.573939343591896],[-92.43759155273436,39.67019926771586],[-93.44558715820312,39.67019926771586],[-93.44558715820312,39.573939343591896]]]})

.set-or-append Polygons <|
print name = "Polygon2", polygon=dynamic({"type": "Polygon","coordinates": [[[-93.36181640625,39.838068180000015],[-90.362548828125,39.838068180000015],[-90.362548828125,39.97291055131899],[-93.36181640625,39.97291055131899],[-93.36181640625,39.838068180000015]]]})

.set-or-append Polygons <|
print name = "Polygon3", polygon=dynamic({"type": "Polygon","coordinates": [[[-93.2684326171875,40.195659093364654],[-90.06591796875,40.195659093364654],[-90.06591796875,40.3549167507906],[-93.2684326171875,40.3549167507906],[-93.2684326171875,40.195659093364654]]]})

The materialized view performs a Cartesian join between the AssetHistory table and the Polygons table; then calculates into which polygon(s) each row falls. The KQL command to create this view is:

.create async materialized-view
    with (backfill=true, docString="Populate polygons", dimensionTables=dynamic(["Polygons"]))
    AssetHistoryView on table AssetHistory
    {
    AssetHistory
    | extend joinColumn = 1
    | join kind=leftouter (Polygons | extend joinColumn = 1) on joinColumn
        | extend name = iif(geo_point_in_polygon(lon, lat), polygon), todynamic(name), dynamic(null))
        | summarize take_any(timestamp), polygonNames = make_list(name)
         by timestamp, assetId, organizationId, lat, lon
    }

The tables are joined on the joincolumn column, which is set to the constant 1 for each row in each table. The result is that every row in polygon is joined to every row in AssetHistory. After this Cartesian join, the KQL make_list function collapses every value in polygons.name into a single row in the result set with each row contain the polygonNames column, which contains an array of matching polygons.

All other columns in AssetHistory are projected into this view.

In queries and functions, you can replace the AssetHistory table with the AssetHistoryView materialized view. This view provides all the data of AssetHistory, along with the extra polygonNames column, allowing you to filter on this without the overhead of calling the geo_point_in_polygon function at query time.

This will allow you to quickly determine which records are within one of the polygons listed in the Polygons table. We can now determine whether each point is in the polygon by simply filtering on the polygonNames column, as in the following query:

let startTime = datetime('2022-01-01T03:00:00Z');
let endTime = datetime('2022-01-01T03:15:00Z');
AssetHistoryView
| where timestamp between (startTime .. endTime)
    and polygonNames contains "Polygon1";

Below is a sample result set from the view.

timeStamp

vehicleId

lat

lon

polygonNames

2022-01-01T03:05:00Z

VEH00002

39.5748825220936

-93.1714339269039

["Polygon1"]

2022-01-01T03:06:00Z

VEH00002

39.5749022197192

-93.1710063886960

["Polygon1"]

2022-01-01T03:07:00Z

VEH00002

39.5753261310915

-93.1703148535774

["Polygon1"]

2022-01-01T03:08:00Z

VEH00002

39.5762358712237

-93.1697172500363

["Polygon1"]

2022-01-01T03:09:00Z

VEH00002

39.5762615376871

-93.1695690820771

["Polygon1"]

2022-01-01T03:10:00Z

VEH00002

39.5763826051913

-93.1692058522320

["Polygon1"]

2022-01-01T03:11:00Z

VEH00002

39.5771514635425

-93.1690354942550

["Polygon1"]

2022-01-01T03:12:00Z

VEH00002

39.5773253635722

-93.1686505543725

["Polygon1"]

2022-01-01T03:13:00Z

VEH00002

39.5774519453794

-93.1681419560536

["Polygon1"]

2022-01-01T03:14:00Z

VEH00002

39.5777309013798

-93.1678841098861

["Polygon1"]

2022-01-01T03:15:00Z

VEH00002

39.5778668180356

-93.1676078073466

["Polygon1"]

 

This approach works well if we know in advance the polygons in which we are interested. It will not work if you want to allow your users to query on ad hoc polygon boundaries.

Conclusions

The Geo functions in KQL are very powerful, and you can use them to perform bounding box queries. If you find that the queries are running slowly, this article describes some pre-calculations you can do to increase performance.

If you want to learn more about this topic, here are some good resources:

To learn more about S2 geometery, visit s2geometry.io,

If you are new to ADX, I recommend starting here to learn about ADX and here to learn about KQL.

Here are links to detailed documentation on the KQL functions mentioned in this article:

Also, I published a collection of articles and videos to help you get up and running quickly using ADX and KQL.

This article covers the Geo functions in KQL

 

 

Co-Authors
Version history
Last update:
‎May 25 2022 08:30 AM
Updated by: