Azure Data Explorer Blog

10 MIN READ

Implementing and Optimizing Bounding Box Queries in Azure Data Explorer

dgiard

Microsoft

May 25, 2022

Overview

For our current project, we are capturing into ADX the location of vehicles over time. Our customer asked us to create a function that would return all vehicles that are within a given bounding box in a given time period. This is useful information when they want to know when a vehicle returns to a building, a neighborhood, or a city.

In this article, I will show how this can be accomplished using built-in functions, the limitations of those functions, and ways to overcome those limitations.

The Data

Our application captures time series and geographic data. The AssetHistory table contains information about the location of vehicles over time. The structure of this table is:

Column	Data Type	Description
timeStamp	datetime	The time of the event
vehicleId	string	Unique ID of the vehicle
lat	double	Location latitude of the vehicle at that time
lon	double	Location longitude of the vehicle at that time

Here is a small sample of data in this table:

timeStamp	vehicleId	lat	lon
2022-01-01T00:00:00Z	VEH00001	39.5537192488917	94.68388959051
2022-01-01T00:00:00Z	VEH00002	39.4831286543648	94.99069681588
2022-01-01T00:01:00Z	VEH00001	39.5595318673797	94.695714322373
2022-01-01T00:01:00Z	VEH00002	39.4834279222404	94.9959276770247
2022-01-01T00:02:00Z	VEH00001	39.5452045858879	94.6948031060648
2022-01-01T00:02:00Z	VEH00002	39.4937282808629	94.9981501288595
2022-01-01T00:03:00Z	VEH00001	39.5500482516451	94.6822965012806
2022-01-01T00:03:00Z	VEH00002	39.5020828393819	94.9988039760402
2022-01-01T00:04:00Z	VEH00001	39.5583474964348	94.68176047996
2022-01-01T00:04:00Z	VEH00002	39.5021519675971	95.0002670662021
2022-01-01T00:05:00Z	VEH00001	39.5476946907459	94.6897078041742
2022-01-01T00:05:00Z	VEH00002	39.4830067290929	94.9939493859201
2022-01-01T00:06:00Z	VEH00001	39.5517730215806	94.6972633402465
2022-01-01T00:06:00Z	VEH00002	39.4892456462058	94.9906174555097

Imagine this dataset extending for many hours, days, or months as the vehicles move over time. For our project, we generated 200 million rows of sample data for dozens of vehicles every 60 seconds that moved north and east over time.

Built-In Function

ADX provides the built-in geo_point_in_polygon function that does exactly what we need. This function accepts 3 arguments: the latitude and longitude of a given location and a polygon object. It returns true if that location is within the bounds of the polygon; otherwise, it returns false.

However, this function does have limitations which I will describe after an example.

Using KQL, we can create a polygon with the following syntax:

let p1Lon = -93.36181640625;
let p1Lat = 39.838068180000015;
let p2Lon = -90.362548828125;
let p2Lat = 39.838068180000015;
let p3Lon = -90.362548828125;
let p3Lat = 39.972910551318989;
let p4Lon = -93.36181640625;
let p4Lat = 39.972910551318989;
let p1 = pack_array(p1Lon, p1Lat);
let p2 = pack_array(p2Lon, p2Lat);
let p3 = pack_array(p3Lon, p3Lat);
let p4 = pack_array(p4Lon, p4Lat);
let polygon = pack("type","Polygon","coordinates", pack_array(pack_array(p1, p2, p3, p4, p1)));

Then, we can write a query using geo_point_in_polygon to determine which rows are within this polygon, as shown below:

let startTime = datetime('2022-01-01T21:25:03Z');
let endTime = datetime('2022-01-01T21:33:06Z');
AssetHistory
| where timestamp between (startTime .. endTime)
    and geo_point_in_polygon(lon, lat, polygon)

This query will return only those rows within the time range that are inside the polygon, as shown in the results below:

timestamp	vehicleId	lat	lon
2022-01-01T21:26:00Z	VEH00001	-93.1514131769570	-93.151413176957
2022-01-01T21:26:00Z	VEH00002	-92.9590414313576	-92.9590414313576
2022-01-01T21:27:00Z	VEH00001	-93.1505516077601	-93.1505516077601
2022-01-01T21:27:00Z	VEH00002	-92.9584569148338	-92.9584569148338
2022-01-01T21:28:00Z	VEH00001	-93.1503058581473	-93.1503058581473
2022-01-01T21:28:00Z	VEH00002	-92.9575508394116	-92.9575508394116
2022-01-01T21:29:00Z	VEH00001	-93.1495162302751	-93.1495162302751
2022-01-01T21:29:00Z	VEH00002	-92.9570576871067	-92.9570576871067
2022-01-01T21:30:00Z	VEH00001	-93.1493412918965	-93.1493412918965
2022-01-01T21:30:00Z	VEH00002	-92.9568601706845	-92.9568601706845
2022-01-01T21:31:00Z	VEH00001	-93.1491006937533	-93.1491006937533
2022-01-01T21:31:00Z	VEH00002	-92.9559658727279	-92.9559658727279
2022-01-01T21:32:00Z	VEH00001	-93.1489732277653	-93.1489732277653
2022-01-01T21:32:00Z	VEH00002	-92.9559072328652	-92.9559072328652
2022-01-01T21:33:00Z	VEH00001	-93.1480159553770	-93.148015955377
2022-01-01T21:33:00Z	VEH00002	-92.9551552238215	-92.9551552238215

Issues with this approach

The simplicity of the geo_point_in_polygon function makes it appealing, but there is a drawback. If we filter on a wide date/time range, it is possible that only a small percentage of the rows in this range will fall within the polygon. This is problematic because, although the query will take advantage of indexes when filtering by time, it will perform a table scan after this time filter. This can be inefficient if nearly all the rows are outside the polygon.

We can address this by doing some calculations in advance. Two possible approaches are:

Calculate S2 cell values in advance
Calculate in advance into which polygon(s) each row falls

Optimize Using S2 Cells

S2 cells divide the Earth into areas of equal size. Each S2 cell is a rectangle projected onto the spherical surface of the Earth. The size of each S2 cell is dependent on its level, which can range from 0 to 30. Lower-numbered level cells are much larger and divide the Earth into only a few cells, while higher number levels are tiny and divide the Earth into many cells.

Fig. 1 illustrates the Earth divided into S2 cells at levels 0-4.

Fig. 1

You can read more about S2 cells here.

Kusto Query Language ("KQL") contains a number of functions for dealing with S2 cells. We can use these functions to optimize our queries. Two useful functions are geo_point_to_s2cell and geo_polygon_to_s2cells. The geo_point_to_s2cell function returns a unique ID for an S2 cell, given a longitude, latitude, and S2 level. The geo_polygon_to_s2cells function returns an array containing the minimum S2 cells that will cover a polygon.

We can use these functions to optimize our query by pre-calculating S2 cells for each row. Start by creating a Materialized View that adds one or more S2 columns, as in the following example:

Next we create a Materialized View that calculates into which polygon each row falls.

.create async materialized-view 
with (backfill=true, docString='Telemetry data in 1-hour bins with S2 calculated' )
AssetHistoryView
on table AssetHistory { 
  AssetHistory
  | extend s2_15 = geo_point_to_s2cell(lon, lat, 15)
| summarize take_any(lon, lat, s2_15) 
      by vehicleId, timestamp
}

The materialized view is updated with every row inserted into AssetHistory, so it calculates the S2 value for each row. We can then use this view in place of the table and take advantage of the calculated S2, as in the following query.

let s2_15_Array = geo_polygon_to_s2cells(polygon, 15);
let devicesInS2Boxes = 
    AssetHistoryView
    | where timestamp between (startDateTime .. endDateTime)
    and fleetId == filterFleetId
    and s2_15 has_any (s2_15_Array);
devicesInS2Boxes
  | where geo_point_in_polygon(current_lon, current_lat, polygon)
  | order by deviceId asc, timestamp asc
  | project deviceId, timestamp, current_lon, current_lat, s2_15
}

This query yields the same results as the one above that does not use S2 cells, but it is potentially faster. This is because we filter our results on an array of S2 cells that is slightly larger than the polygon before we filter on the polygon function.

Figures 2 - 5 illustrate this filtering.

Fig. 2, shows a sample polygon.

Fig. 3 shows a sample of vehicle data points over time.

In Fig. 4, we overlap the polygon with an array of S2 cells. These cells completely cover the polygon and cover an area slightly larger than the polygon.

Fig. 5 shows the filtering of vehicle data points to only those within the S2 cells. When the query does a scan to filter on geo_point_in_polygon, it now has fewer rows to scan.

Fig. 2: a sample polygon

Fig. 3: a sample of vehicle data points over time

Fig. 4: the polygon overlapped by S2 cells

Fig. 5: filter vehicle data points to only those within the S2 cells

We can get similar results by applying a policy to our AssetHistory table that populates another table with the fields in AssetHistory, along with the calculated S2 column. The following code creates a function to calculate an S2 value for each row and a policy to populate a table with this value, along with all the columns in the AssetHistory table:

.create-or-alter function with (docstring = "Calculate S2 Value for AssetHistory", skipvalidation = "false" )
calculateS2ForAssetHistory() {
    AssetHistory
        | project
            timeStamp, 
            vehicleId, 
            lat, 
            lon, 
            s2_15 = geo_point_to_s2cell(lon, lat, 15)
}

.alter table AssetHistoryWithS2
    policy update
    @'[{"IsEnabled": true, "Source": "AssetHistory", "Query": "calculateS2ForAssetHistory()", "IsTransactional": true}]'

We can then query the AssetHistoryWithS2 table in the same way that we queried the AssetHistoryView materialized view above to speed up our geo queries.

This approach works well in optimizing the bounding box query, but it has some disadvantages. You need to decide in advance which S2 level to use. This will be based on the size of your polygon, as you want the array of S2 cells to be slightly larger than the polygon. You can find a list of the approximate size of each S2 level cell here. The ADX array also has limitations. Also, you may run into memory issues if you array of S2 cells becomes too large.

KQL also supports H3 geographic cell types and includes functions similar to those supported for S2. We chose to use S2 because the 4-sided cells are simpler and more intuitive to work with than H3's hexagons.

Optimize By Pre-Calculating for Well-Known Polygons

Another approach is to determine in advance into which polygon each row's location falls. This works well if we know the polygons we wish to use. For example, the client may repeatedly ask which vehicles are inside the same set of buildings or neighborhoods or cities.

To do this, we create a Polygons table that contains the coordinates of the polygons. This table consists of 2 columns:

Column Name	Data Type	Description of column
description	string	A descriptive name for this polygon
polygon	dynamic	A geojson representation of the polygon

To do this, we create a Polygons table that contains the coordinates of the polygons. This table consists of 2 columns:

Column Name	Data Type	Description of column
description	string	A descriptive name for this polygon
polygon	dynamic	A geojson representation of the polygon

The script to create this table is:

.create-merge table Polygons(
description: string,
polygon: dynamic)

Below is some sample code to populate the Polygons table containing 3 polygons:

.set-or-append Polygons <|
print name = "Polygon1", polygon=dynamic({"type": "Polygon","coordinates": [[[-93.44558715820312,39.573939343591896],[-92.43759155273436,39.573939343591896],[-92.43759155273436,39.67019926771586],[-93.44558715820312,39.67019926771586],[-93.44558715820312,39.573939343591896]]]})

.set-or-append Polygons <|
print name = "Polygon2", polygon=dynamic({"type": "Polygon","coordinates": [[[-93.36181640625,39.838068180000015],[-90.362548828125,39.838068180000015],[-90.362548828125,39.97291055131899],[-93.36181640625,39.97291055131899],[-93.36181640625,39.838068180000015]]]})

.set-or-append Polygons <|
print name = "Polygon3", polygon=dynamic({"type": "Polygon","coordinates": [[[-93.2684326171875,40.195659093364654],[-90.06591796875,40.195659093364654],[-90.06591796875,40.3549167507906],[-93.2684326171875,40.3549167507906],[-93.2684326171875,40.195659093364654]]]})

The materialized view performs a Cartesian join between the AssetHistory table and the Polygons table; then calculates into which polygon(s) each row falls. The KQL command to create this view is:

.create async materialized-view
    with (backfill=true, docString="Populate polygons", dimensionTables=dynamic(["Polygons"]))
    AssetHistoryView on table AssetHistory
    {
    AssetHistory
    | extend joinColumn = 1
    | join kind=leftouter (Polygons | extend joinColumn = 1) on joinColumn
        | extend name = iif(geo_point_in_polygon(lon, lat), polygon), todynamic(name), dynamic(null))
        | summarize take_any(timestamp), polygonNames = make_list(name)
         by timestamp, assetId, organizationId, lat, lon
    }

The tables are joined on the joincolumn column, which is set to the constant 1 for each row in each table. The result is that every row in polygon is joined to every row in AssetHistory. After this Cartesian join, the KQL make_list function collapses every value in polygons.name into a single row in the result set with each row contain the polygonNames column, which contains an array of matching polygons.

All other columns in AssetHistory are projected into this view.

In queries and functions, you can replace the AssetHistory table with the AssetHistoryView materialized view. This view provides all the data of AssetHistory, along with the extra polygonNames column, allowing you to filter on this without the overhead of calling the geo_point_in_polygon function at query time.

This will allow you to quickly determine which records are within one of the polygons listed in the Polygons table. We can now determine whether each point is in the polygon by simply filtering on the polygonNames column, as in the following query:

let startTime = datetime('2022-01-01T03:00:00Z');
let endTime = datetime('2022-01-01T03:15:00Z');
AssetHistoryView
| where timestamp between (startTime .. endTime)
    and polygonNames contains "Polygon1";

Below is a sample result set from the view.

timeStamp	vehicleId	lat	lon	polygonNames
2022-01-01T03:05:00Z	VEH00002	39.5748825220936	-93.1714339269039	["Polygon1"]
2022-01-01T03:06:00Z	VEH00002	39.5749022197192	-93.1710063886960	["Polygon1"]
2022-01-01T03:07:00Z	VEH00002	39.5753261310915	-93.1703148535774	["Polygon1"]
2022-01-01T03:08:00Z	VEH00002	39.5762358712237	-93.1697172500363	["Polygon1"]
2022-01-01T03:09:00Z	VEH00002	39.5762615376871	-93.1695690820771	["Polygon1"]
2022-01-01T03:10:00Z	VEH00002	39.5763826051913	-93.1692058522320	["Polygon1"]
2022-01-01T03:11:00Z	VEH00002	39.5771514635425	-93.1690354942550	["Polygon1"]
2022-01-01T03:12:00Z	VEH00002	39.5773253635722	-93.1686505543725	["Polygon1"]
2022-01-01T03:13:00Z	VEH00002	39.5774519453794	-93.1681419560536	["Polygon1"]
2022-01-01T03:14:00Z	VEH00002	39.5777309013798	-93.1678841098861	["Polygon1"]
2022-01-01T03:15:00Z	VEH00002	39.5778668180356	-93.1676078073466	["Polygon1"]

This approach works well if we know in advance the polygons in which we are interested. It will not work if you want to allow your users to query on ad hoc polygon boundaries.

Conclusions

The Geo functions in KQL are very powerful, and you can use them to perform bounding box queries. If you find that the queries are running slowly, this article describes some pre-calculations you can do to increase performance.

If you want to learn more about this topic, here are some good resources:

To learn more about S2 geometery, visit s2geometry.io,

If you are new to ADX, I recommend starting here to learn about ADX and here to learn about KQL.

Here are links to detailed documentation on the KQL functions mentioned in this article:

Also, I published a collection of articles and videos to help you get up and running quickly using ADX and KQL.

This article covers the Geo functions in KQL

Updated May 25, 2022

Version 2.0