Implementing and Optimizing Bounding Box Queries in Azure Data Explorer
For our current project, we are capturing into ADX the location of vehicles over time. Our customer asked us to create a function that would return all vehicles that are within a given bounding box in a given time period. This is useful information when they want to know when a vehicle returns to a building, a neighborhood, or a city.
In this article, I will show how this can be accomplished using built-in functions, the limitations of those functions, and ways to overcome those limitations.
Our application captures time series and geographic data. The AssetHistory table contains information about the location of vehicles over time. The structure of this table is:
Column |
Data Type |
Description |
timeStamp |
datetime |
The time of the event |
vehicleId |
string |
Unique ID of the vehicle |
lat |
double |
Location latitude of the vehicle at that time |
lon |
double |
Location longitude of the vehicle at that time |
Here is a small sample of data in this table:
timeStamp |
vehicleId |
lat |
lon |
2022-01-01T00:00:00Z |
VEH00001 |
39.5537192488917 |
94.68388959051 |
2022-01-01T00:00:00Z |
VEH00002 |
39.4831286543648 |
94.99069681588 |
2022-01-01T00:01:00Z |
VEH00001 |
39.5595318673797 |
94.695714322373 |
2022-01-01T00:01:00Z |
VEH00002 |
39.4834279222404 |
94.9959276770247 |
2022-01-01T00:02:00Z |
VEH00001 |
39.5452045858879 |
94.6948031060648 |
2022-01-01T00:02:00Z |
VEH00002 |
39.4937282808629 |
94.9981501288595 |
2022-01-01T00:03:00Z |
VEH00001 |
39.5500482516451 |
94.6822965012806 |
2022-01-01T00:03:00Z |
VEH00002 |
39.5020828393819 |
94.9988039760402 |
2022-01-01T00:04:00Z |
VEH00001 |
39.5583474964348 |
94.68176047996 |
2022-01-01T00:04:00Z |
VEH00002 |
39.5021519675971 |
95.0002670662021 |
2022-01-01T00:05:00Z |
VEH00001 |
39.5476946907459 |
94.6897078041742 |
2022-01-01T00:05:00Z |
VEH00002 |
39.4830067290929 |
94.9939493859201 |
2022-01-01T00:06:00Z |
VEH00001 |
39.5517730215806 |
94.6972633402465 |
2022-01-01T00:06:00Z |
VEH00002 |
39.4892456462058 |
94.9906174555097 |
Imagine this dataset extending for many hours, days, or months as the vehicles move over time. For our project, we generated 200 million rows of sample data for dozens of vehicles every 60 seconds that moved north and east over time.
ADX provides the built-in geo_point_in_polygon function that does exactly what we need. This function accepts 3 arguments: the latitude and longitude of a given location and a polygon object. It returns true if that location is within the bounds of the polygon; otherwise, it returns false.
However, this function does have limitations which I will describe after an example.
Using KQL, we can create a polygon with the following syntax:
let p1Lon = -93.36181640625; let p1Lat = 39.838068180000015; let p2Lon = -90.362548828125; let p2Lat = 39.838068180000015; let p3Lon = -90.362548828125; let p3Lat = 39.972910551318989; let p4Lon = -93.36181640625; let p4Lat = 39.972910551318989; let p1 = pack_array(p1Lon, p1Lat); let p2 = pack_array(p2Lon, p2Lat); let p3 = pack_array(p3Lon, p3Lat); let p4 = pack_array(p4Lon, p4Lat); let polygon = pack("type","Polygon","coordinates", pack_array(pack_array(p1, p2, p3, p4, p1)));
Then, we can write a query using geo_point_in_polygon to determine which rows are within this polygon, as shown below:
let startTime = datetime('2022-01-01T21:25:03Z'); let endTime = datetime('2022-01-01T21:33:06Z'); AssetHistory | where timestamp between (startTime .. endTime) and geo_point_in_polygon(lon, lat, polygon)
This query will return only those rows within the time range that are inside the polygon, as shown in the results below:
timestamp |
vehicleId |
lat |
lon |
2022-01-01T21:26:00Z |
VEH00001 |
-93.1514131769570 |
-93.151413176957 |
2022-01-01T21:26:00Z |
VEH00002 |
-92.9590414313576 |
-92.9590414313576 |
2022-01-01T21:27:00Z |
VEH00001 |
-93.1505516077601 |
-93.1505516077601 |
2022-01-01T21:27:00Z |
VEH00002 |
-92.9584569148338 |
-92.9584569148338 |
2022-01-01T21:28:00Z |
VEH00001 |
-93.1503058581473 |
-93.1503058581473 |
2022-01-01T21:28:00Z |
VEH00002 |
-92.9575508394116 |
-92.9575508394116 |
2022-01-01T21:29:00Z |
VEH00001 |
-93.1495162302751 |
-93.1495162302751 |
2022-01-01T21:29:00Z |
VEH00002 |
-92.9570576871067 |
-92.9570576871067 |
2022-01-01T21:30:00Z |
VEH00001 |
-93.1493412918965 |
-93.1493412918965 |
2022-01-01T21:30:00Z |
VEH00002 |
-92.9568601706845 |
-92.9568601706845 |
2022-01-01T21:31:00Z |
VEH00001 |
-93.1491006937533 |
-93.1491006937533 |
2022-01-01T21:31:00Z |
VEH00002 |
-92.9559658727279 |
-92.9559658727279 |
2022-01-01T21:32:00Z |
VEH00001 |
-93.1489732277653 |
-93.1489732277653 |
2022-01-01T21:32:00Z |
VEH00002 |
-92.9559072328652 |
-92.9559072328652 |
2022-01-01T21:33:00Z |
VEH00001 |
-93.1480159553770 |
-93.148015955377 |
2022-01-01T21:33:00Z |
VEH00002 |
-92.9551552238215 |
-92.9551552238215 |
The simplicity of the geo_point_in_polygon function makes it appealing, but there is a drawback. If we filter on a wide date/time range, it is possible that only a small percentage of the rows in this range will fall within the polygon. This is problematic because, although the query will take advantage of indexes when filtering by time, it will perform a table scan after this time filter. This can be inefficient if nearly all the rows are outside the polygon.
We can address this by doing some calculations in advance. Two possible approaches are:
S2 cells divide the Earth into areas of equal size. Each S2 cell is a rectangle projected onto the spherical surface of the Earth. The size of each S2 cell is dependent on its level, which can range from 0 to 30. Lower-numbered level cells are much larger and divide the Earth into only a few cells, while higher number levels are tiny and divide the Earth into many cells.
Fig. 1 illustrates the Earth divided into S2 cells at levels 0-4.
Fig. 1
You can read more about S2 cells here.
Kusto Query Language ("KQL") contains a number of functions for dealing with S2 cells. We can use these functions to optimize our queries. Two useful functions are geo_point_to_s2cell and geo_polygon_to_s2cells. The geo_point_to_s2cell function returns a unique ID for an S2 cell, given a longitude, latitude, and S2 level. The geo_polygon_to_s2cells function returns an array containing the minimum S2 cells that will cover a polygon.
We can use these functions to optimize our query by pre-calculating S2 cells for each row. Start by creating a Materialized View that adds one or more S2 columns, as in the following example:
Next we create a Materialized View that calculates into which polygon each row falls.
.create async materialized-view with (backfill=true, docString='Telemetry data in 1-hour bins with S2 calculated' ) AssetHistoryView on table AssetHistory { AssetHistory | extend s2_15 = geo_point_to_s2cell(lon, lat, 15) | summarize take_any(lon, lat, s2_15) by vehicleId, timestamp }
The materialized view is updated with every row inserted into AssetHistory, so it calculates the S2 value for each row. We can then use this view in place of the table and take advantage of the calculated S2, as in the following query.
let s2_15_Array = geo_polygon_to_s2cells(polygon, 15); let devicesInS2Boxes = AssetHistoryView | where timestamp between (startDateTime .. endDateTime) and fleetId == filterFleetId and s2_15 has_any (s2_15_Array); devicesInS2Boxes | where geo_point_in_polygon(current_lon, current_lat, polygon) | order by deviceId asc, timestamp asc | project deviceId, timestamp, current_lon, current_lat, s2_15 }
This query yields the same results as the one above that does not use S2 cells, but it is potentially faster. This is because we filter our results on an array of S2 cells that is slightly larger than the polygon before we filter on the polygon function.
Figures 2 - 5 illustrate this filtering.
Fig. 2, shows a sample polygon.
Fig. 3 shows a sample of vehicle data points over time.
In Fig. 4, we overlap the polygon with an array of S2 cells. These cells completely cover the polygon and cover an area slightly larger than the polygon.
Fig. 5 shows the filtering of vehicle data points to only those within the S2 cells. When the query does a scan to filter on geo_point_in_polygon, it now has fewer rows to scan.
Fig. 2: a sample polygon
Fig. 3: a sample of vehicle data points over time
Fig. 4: the polygon overlapped by S2 cells
Fig. 5: filter vehicle data points to only those within the S2 cells
We can get similar results by applying a policy to our AssetHistory table that populates another table with the fields in AssetHistory, along with the calculated S2 column. The following code creates a function to calculate an S2 value for each row and a policy to populate a table with this value, along with all the columns in the AssetHistory table:
.create-or-alter function with (docstring = "Calculate S2 Value for AssetHistory", skipvalidation = "false" ) calculateS2ForAssetHistory() { AssetHistory | project timeStamp, vehicleId, lat, lon, s2_15 = geo_point_to_s2cell(lon, lat, 15) } .alter table AssetHistoryWithS2 policy update @'[{"IsEnabled": true, "Source": "AssetHistory", "Query": "calculateS2ForAssetHistory()", "IsTransactional": true}]'
We can then query the AssetHistoryWithS2 table in the same way that we queried the AssetHistoryView materialized view above to speed up our geo queries.
This approach works well in optimizing the bounding box query, but it has some disadvantages. You need to decide in advance which S2 level to use. This will be based on the size of your polygon, as you want the array of S2 cells to be slightly larger than the polygon. You can find a list of the approximate size of each S2 level cell here. The ADX array also has limitations. Also, you may run into memory issues if you array of S2 cells becomes too large.
KQL also supports H3 geographic cell types and includes functions similar to those supported for S2. We chose to use S2 because the 4-sided cells are simpler and more intuitive to work with than H3's hexagons.
Another approach is to determine in advance into which polygon each row's location falls. This works well if we know the polygons we wish to use. For example, the client may repeatedly ask which vehicles are inside the same set of buildings or neighborhoods or cities.
To do this, we create a Polygons table that contains the coordinates of the polygons. This table consists of 2 columns:
Column Name |
Data Type |
Description of column |
description |
string |
A descriptive name for this polygon |
polygon |
dynamic |
A geojson representation of the polygon |
To do this, we create a Polygons table that contains the coordinates of the polygons. This table consists of 2 columns:
Column Name |
Data Type |
Description of column |
description |
string |
A descriptive name for this polygon |
polygon |
dynamic |
A geojson representation of the polygon |
The script to create this table is:
.create-merge table Polygons( description: string, polygon: dynamic)
Below is some sample code to populate the Polygons table containing 3 polygons:
.set-or-append Polygons <| print name = "Polygon1", polygon=dynamic({"type": "Polygon","coordinates": [[[-93.44558715820312,39.573939343591896],[-92.43759155273436,39.573939343591896],[-92.43759155273436,39.67019926771586],[-93.44558715820312,39.67019926771586],[-93.44558715820312,39.573939343591896]]]}) .set-or-append Polygons <| print name = "Polygon2", polygon=dynamic({"type": "Polygon","coordinates": [[[-93.36181640625,39.838068180000015],[-90.362548828125,39.838068180000015],[-90.362548828125,39.97291055131899],[-93.36181640625,39.97291055131899],[-93.36181640625,39.838068180000015]]]}) .set-or-append Polygons <| print name = "Polygon3", polygon=dynamic({"type": "Polygon","coordinates": [[[-93.2684326171875,40.195659093364654],[-90.06591796875,40.195659093364654],[-90.06591796875,40.3549167507906],[-93.2684326171875,40.3549167507906],[-93.2684326171875,40.195659093364654]]]})
The materialized view performs a Cartesian join between the AssetHistory table and the Polygons table; then calculates into which polygon(s) each row falls. The KQL command to create this view is:
.create async materialized-view with (backfill=true, docString="Populate polygons", dimensionTables=dynamic(["Polygons"])) AssetHistoryView on table AssetHistory { AssetHistory | extend joinColumn = 1 | join kind=leftouter (Polygons | extend joinColumn = 1) on joinColumn | extend name = iif(geo_point_in_polygon(lon, lat), polygon), todynamic(name), dynamic(null)) | summarize take_any(timestamp), polygonNames = make_list(name) by timestamp, assetId, organizationId, lat, lon }
The tables are joined on the joincolumn column, which is set to the constant 1 for each row in each table. The result is that every row in polygon is joined to every row in AssetHistory. After this Cartesian join, the KQL make_list function collapses every value in polygons.name into a single row in the result set with each row contain the polygonNames column, which contains an array of matching polygons.
All other columns in AssetHistory are projected into this view.
In queries and functions, you can replace the AssetHistory table with the AssetHistoryView materialized view. This view provides all the data of AssetHistory, along with the extra polygonNames column, allowing you to filter on this without the overhead of calling the geo_point_in_polygon function at query time.
This will allow you to quickly determine which records are within one of the polygons listed in the Polygons table. We can now determine whether each point is in the polygon by simply filtering on the polygonNames column, as in the following query:
let startTime = datetime('2022-01-01T03:00:00Z'); let endTime = datetime('2022-01-01T03:15:00Z'); AssetHistoryView | where timestamp between (startTime .. endTime) and polygonNames contains "Polygon1";
Below is a sample result set from the view.
timeStamp |
vehicleId |
lat |
lon |
polygonNames |
2022-01-01T03:05:00Z |
VEH00002 |
39.5748825220936 |
-93.1714339269039 |
["Polygon1"] |
2022-01-01T03:06:00Z |
VEH00002 |
39.5749022197192 |
-93.1710063886960 |
["Polygon1"] |
2022-01-01T03:07:00Z |
VEH00002 |
39.5753261310915 |
-93.1703148535774 |
["Polygon1"] |
2022-01-01T03:08:00Z |
VEH00002 |
39.5762358712237 |
-93.1697172500363 |
["Polygon1"] |
2022-01-01T03:09:00Z |
VEH00002 |
39.5762615376871 |
-93.1695690820771 |
["Polygon1"] |
2022-01-01T03:10:00Z |
VEH00002 |
39.5763826051913 |
-93.1692058522320 |
["Polygon1"] |
2022-01-01T03:11:00Z |
VEH00002 |
39.5771514635425 |
-93.1690354942550 |
["Polygon1"] |
2022-01-01T03:12:00Z |
VEH00002 |
39.5773253635722 |
-93.1686505543725 |
["Polygon1"] |
2022-01-01T03:13:00Z |
VEH00002 |
39.5774519453794 |
-93.1681419560536 |
["Polygon1"] |
2022-01-01T03:14:00Z |
VEH00002 |
39.5777309013798 |
-93.1678841098861 |
["Polygon1"] |
2022-01-01T03:15:00Z |
VEH00002 |
39.5778668180356 |
-93.1676078073466 |
["Polygon1"] |
This approach works well if we know in advance the polygons in which we are interested. It will not work if you want to allow your users to query on ad hoc polygon boundaries.
The Geo functions in KQL are very powerful, and you can use them to perform bounding box queries. If you find that the queries are running slowly, this article describes some pre-calculations you can do to increase performance.
If you want to learn more about this topic, here are some good resources:
To learn more about S2 geometery, visit s2geometry.io,
If you are new to ADX, I recommend starting here to learn about ADX and here to learn about KQL.
Here are links to detailed documentation on the KQL functions mentioned in this article:
Also, I published a collection of articles and videos to help you get up and running quickly using ADX and KQL.
This article covers the Geo functions in KQL
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.