Azure Synapse Analytics Blog

14 MIN READ

Spatial analysis in Azure Synapse Analytics with ArcGIS GeoAnalytics Engine

Microsoft

Feb 14, 2023

Data is only as valuable as what can be done with it. As data collection from multiple, disparate sources accelerates, companies across all industries have the challenge of finding value amongst diverse datasets. Fortunately, 80% of all data has location attributes. This allows geospatial analysis techniques to be used to uncover insights in these seemingly unrelated and possibly dissimilar data sets.

The demand for advanced and performant geospatial analysis (or “geoanalytics”) is increasing due to the rapid growth of location-based data collected via smart devices and Internet of Things (IoT) endpoints across industries. Data scientists and big data analysts, who have already invested heavily in their big data analytics infrastructure, such as Azure Synapse Analytics, have been asking for geoanalytics capabilities that can be seamlessly integrated with open-source packages within their existing big data workflow in order to perform simple to complex spatial analysis instantly.

To meet that demand, Esri, the global leader in GIS, has recently introduced their latest innovation, ArcGIS GeoAnalytics Engine, which is a comprehensive library for advanced and performant spatial analytics. This library is delivered to the customer’s existing big data infrastructure – the cloud environment (e.g., Azure Synapse Analytics) – to enable parallelized and distributed geoanalytics workflows by extending Apache Spark™ with ready-to-use SQL functions and analysis tools.

What is Azure Synapse?

Azure Synapse is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. Azure Synapse brings together the best of SQL technologies used in enterprise data warehousing: Spark technologies used for big data, Data Explorer for log and time series analytics, Pipelines for data integration and ETL/ELT, and deep integration with other Azure services such as Power BI, CosmosDB, and AzureML.

What is Esri’s GeoAnalytics Engine?

GeoAnalytics Engine provides a Cloud native and Spark native solution for spatial big data analytics. What that means is that Esri has decoupled GeoAnalytics tools and functions from ArcGIS, and they now allow for the deployment of these tools within Spark environments outside of ArcGIS. This enables data scientists and big data analysts to perform fast large-scale batch processing of big datasets using a cluster of machines; irrespective of whether it is on premises or in the cloud. The key features of GeoAnalytics Engine include:

120+ spatial SQL functions - Create geometries, test spatial relationships, and more using Python or SQL syntax.
15+ powerful analysis tools - Run common spatiotemporal and statistical analysis workflows with only a few lines of code.
Automatic spatial indexing - Perform optimized spatial joins and other SQL operations immediately.
Read from and write to common data sources - Load and save data from shapefiles, feature services, and vector tiles.
Easy to use - Build spatially-enabled big data pipelines with an intuitive Python API that extends PySpark.

Deployment Architecture

GeoAnalytics Engine can be installed on a personal computer, a standalone Spark cluster, or a managed Spark service in the cloud. Its deployment in Azure Synapse Analytics starts with configuring your Synapse workspace and installing the geoanalyitcs.jar file and the geoanalytics.whl file as Workspace packages in Azure Synapse Studio. Then, a new Apache Spark pool is created to configure the Spark environment. Next, users are ready to perform analysis using a PySpark notebook and data can be easily accessed from their cloud data repositories.

Following the analysis, users can:

Persist the results by writing them to Azure
Persist the results in ArcGIS ecosystem by writing them as feature services
Visualize the results with BI (Business Intelligence) services, such as Power BI

Figure 1. GeoAnalytics Engine architecture for Azure Synapse Analytics spark environment.

Spatial capabilities: SQL functions and analysis tools

ArcGIS GeoAnalytics Engine is a powerful analytical library that enables data scientists and analysts to easily work with common GIS formats. The ability to process geospatial data using GIS formats provides great interoperability between Azure Synapse Analytics service and Esri products. Check out this documentation, which includes multiple tutorials to demonstrate common workflows and usage patterns including how to analyze, visualize, and read/write capabilities with Shapefiles and Feature Services.

Currently, GeoAnalytics Engine provides 120+ SQL functions and 15+ spatial analysis tools that support advanced spatial and spatiotemporal analysis. These functions extend the Spark SQL API by enabling spatial queries on DataFrame columns. They can be called by Python functions or in a PySpark SQL query statement to enable creating geometries, operating on geometries, evaluating spatial relationships, summarizing geometries, and more. In contrast to SQL functions, which operate on a row-by-row basis using one or two columns, GeoAnalytics tools are aware of all columns in a DataFrame and use all rows to compute a result, if required. This wide array of analysis tools enables you to manage, enrich, summarize, or analyze entire datasets.

· Aggregate Points

· Calculate Density

· Find Hotspots

· Find Point Clusters

· Geographically Weighted Regression (GWR)

· Detect Incidents

· Find Dwell Locations

· Find Similar Locations

· Calculate Field

· Clip

· Overlay

· Spatiotemporal Join

· Reconstruct Tracks

· Summarize Within

· Trace Proximity Events

· Calculate Motion Statistics

· Group By Proximity

Computational performance benchmarks

Often compute time is a deciding factor for many customers in their analytics solutions and time-critical decision-making scenarios. With its automated spatial indexing feature, GeoAnalytics Engine can provide significantly better compute performance when running big data spatial analytics compared to open-source spatial packages. Esri’s benchmark testing has seen that performance gains can increase as the size of the data increases, allowing users to experience better performance for larger datasets. Compute times are shown below for a spatial intersection task that joined two input datasets (points and polygons) with varied sizes up to millions of data records.

Spatial Intersection Inputs		Compute Time (Seconds)
*Left Dataset*	*Right Dataset*	*Single Machine*	*Multi-Machine*
50 polygons	6K points	6	5
3K polygons	6K points	10	5
3K polygons	2M points	19	9
3K polygons	17M points	46	16
220K polygons	17M points	80	29
11M polygons	17M points	515 (8.6 min)	129 (2.1 min)
11M polygons	19M points	1,373 (22 min)	310 (5 min)

Cross-industry use case

This use case demonstrates a cross-industry use case of GeoAnalytics Engine spatial SQL functions and analysis tools using very large datasets. We aimed at showcasing support for scalable spatiotemporal analysis that can enable (spatial) data scientists across industries to quickly make important business decisions.

Continuously growing datasets on human mobility and financial transactions offer unprecedented opportunity to better understand human activity at scale. These datasets contain anonymized time-sequential records of people’s movement and debit/credit card transactions leading up to hundreds of millions of records, sometimes with hundreds of attributes. Spatiotemporal analysis of these very large and interesting datasets can be helpful in a variety of applications contexts. However, processing and analyzing such large datasets may require specialized, efficient geoanalytics functions and tools that are optimized for running in a distributed computing environment. Using GeoAnalytics Engine in the Synapse Analytics environment, we can dig deeper into these massive datasets to gain actionable insights on human activity patterns. Specifically, we can answer the following questions:

What is the aggregated spatial pattern of mobile devices activity throughout the United States?
Where do people spend time around an urban area?
Is there a correlation between where people spend money and where they spend time?
Did the COVID-19 pandemic affect people’s spending ($) patterns?

To answer these questions, we used Cell Analytics^TMand Places plus Spend datasets from Esri partner Ookla^®and SafeGraph, respectively. Cell Analytics^TM provides a crowdsourced dataset that contains time-sequential records of wireless service performance, coverage, and signal measurements based on the connected Speedtest^® application in users mobile devices around the world. SafeGraph Places provides an attribute rich database of global points of interest and SafeGraph Spend provides an anonymized, permissioned, and aggregated transaction dataset that is tied to those individual places in the US.

What is the aggregated spatial pattern of mobile devices?

This question can be answered by multiple tools in GeoAnalytics Engine including Find Hot Spots, Aggregate Points, and Calculate Density. First, we read the massive Ookla dataset directly from Azure Blob Storage within seconds with just a single line of code with GeoAnalytics Engine, bypassing a patchwork of libraries and dependencies. Reading and geo-processing (e.g., creating and transforming geometry, spatial indexing, spatial and attribute filtering) of such high-volume data could take several hours with other tools that are not optimized for Spark. Once the data is ready, we used the Find Hot Spots tool – as shown in the code snippet below – to quickly identify statistically significant hot spots and cold spots in a ~6.8 billion Cell Analytics data record (2018-2020) based on the Getis-Ord Gi* statistic. The resulting DataFrame is visualized using st.plot() - a lightweight plotting method included with GeoAnalytics Engine that allows instant viewing of geometries stored in a PySpark DataFrame - in Figure 2, which shows distinct spatial clusters of Ookla connected devices across the conterminous United States.

import geoanalytics
from geoanalytics.sql import functions as ST
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.functions import to_timestamp, date_trunc, col
from geoanalytics.tools import FindHotSpots

# Read-in dataset and geo-enable 
df_data = spark.read.parquet("abfss://---/*.parquet") \
      .selectExpr("*", "ST_Point(client_longitude, client_latitude, 4326) as SHAPE") \
      .st.set_geometry_field("SHAPE") \
      .withColumn("date_", to_date(col("result_date"))) \
      .withColumn("time_stamp", to_timestamp('result_date', "yyyy-MM-dd HH:mm:ss")) \
      .st.set_time_fields("time_stamp")

# Filter
df = spark.sql(
"""
SELECT
  device_id,
  SHAPE,
  time_stamp,
  device_model,
  rsrp
FROM
  df_data
WHERE
  rsrp > -200 
  AND rsrp < 0 
  AND rsrq > -30 
  AND rsrq < 0
  AND date_ >= '2018-01-01'
  AND date_ <= '2020-12-31'
"""
)

# Find hot spots
result_hot = FindHotSpots() \
            		.setBins(bin_size=15000, bin_size_unit="Meters") \
            		.setNeighborhood(distance=100000, distance_unit="Meters") \
            		.run(dataframe=df)

Figure 2. Hot spot map of Ookla Cell Analytics background sampling devices (2018-2020) in the Conterminous United States.

We also identified the spatial patterns of average network strength of Ookla connected devices by summarizing rsrp column into HexBins using Aggregate Points tool from GeoAnalytics Engine. The map in Figure 3 showcases the network strength pattern of mobile devices connected through all major network providers including AT&T, Verizon, etc.

from geoanalytics.tools import AggregatePoints

# Aggregate data into bins
result_bins = AggregatePoints() \
          .setBins(bin_size=500000, bin_size_unit="Meters",bin_type="Hexagon") \
          .addSummaryField(summary_field="rsrp",statistic="Mean").run(df)

Figure 3. Ookla data aggregated into HexBins that show network strength - from poor (red) to excellent (green) - of all the connected devices (2018-2020).

Where do people spend time around Denver?

We wanted to narrow down our analysis focus around a specific urban area, such as Denver, Colorado. To peek into people’s activity space over a weekend, we analyzed time-sequential Ookla data points of Saturday January 4, 2020. Following the spatial and temporal filtering of ~6 billion data points, we looked for devices that have dwelled within 100 meters for at least 15 minutes using Find Dwell Locations tool. That resulted in a PySpark DataFrame with columns such as DwellDuration(minutes) and MeanDistance(meters) - i.e., the average distance between consecutive points in a dwell location. We then aggregated DwellDuration(minutes) into 500-meter HexBins using the AggregatePoints tool. The resulting DataFrame was then published to ArcGIS Online as a feature layer which has been styled and published as an interactive web app (Figure 4).

from geoanalytics.tools import FindDwellLocations
# Find dwell locations
result_dwell = FindDwellLocations() \
       .setTrackFields("device_id") \
       .setDistanceMethod(distance_method="Planar") \
       .setDwellMaxDistance(max_distance=100, max_distance_unit="Meters") \
       .setDwellMinDuration(min_duration=15, min_duration_unit="Minutes") \
       .setOutputType(output_type="Dwellpoints").run(dataframe=ookla_denver_2020_01_04)
 
result_dwell = result_dwell.withColumn("DwellDuration_minutes", F.col("DwellDuration") / 60000)
 
# Aggregate dwell duration into bins
from geoanalytics.tools import AggregatePoints
result_dwell_bins = AggregatePoints().setBins(bin_size=500, bin_size_unit="Meters",bin_type="Hexagon") \
    .addSummaryField(summary_field="DwellDuration_minutes", statistic="Mean") \
    .addSummaryField(summary_field="DwellDuration_minutes", statistic="Max") \
    .addSummaryField(summary_field="DwellDuration_minutes", statistic="Min") \
    .addSummaryField(summary_field="DwellDuration_minutes", statistic="Sum") \
    .run(result_dwell)

# Publish result to ArcGIS Online
from arcgis import GIS
gis = GIS(username="xxxx", password="yyyy")

sdf = result_dwell_bins.st.to_pandas_sdf()
lyr = sdf.spatial.to_featurelayer('ookla_dwells_Jan04_2020_Denver')

Figure 4. Average dwell duration map of connected devices in Denver on Saturday, January 4, 2020.

Where do people spend money on a particular day of the year?

Next, we delved into exploring people’s shopping patterns in the Denver area on the same day – January 4, 2020. For that, we ingested SafeGraph Places and Spend data from Azure blob storage and then cleaned and geo-processed the data with point geometry using GeoAnalytics Engine and PySpark SQL functions. The web app developed in ArcGIS Online, based on the exported feature layer, is shown in Figure 5. It highlights spending patterns of different business point-of-interests (POIs) around the Denver area. It is noticeable that there were quite a few POIs where people spent relatively more money on January 4^th.

# Read-in SafeGraph Spend data and geo-process

df_spend_2020_01 = spark.read.option("header", True).option("escape", "\"").csv("/mnt/safegraph/spend_patterns/y=2020/m=1/*.csv.gz").withColumn("Month", F.lit("Jan"))
 
spend_2020_01_pnt = (df_spend_2020_01
              .selectExpr("*", "ST_Point(longitude, latitude, 4326) as SHAPE") 
              .withColumn("Date_start", F.to_timestamp(date_format("spend_date_range_start","yyyy-MM-dd HH:mm:ss"), "yyyy-MM-dd HH:mm:ss"))
              .withColumn("Date_end", F.to_timestamp(date_format("spend_date_range_end","yyyy-MM-dd HH:mm:ss"), "yyyy-MM-dd HH:mm:ss"))
              .withColumn("Spend", df_spend_2020_01["raw_total_spend"].cast(DoubleType())) 
              .withColumn("online_trans", df_spend_2020_01["online_transactions"].cast(IntegerType())) 
              .withColumn("Tot_online_spend", df_spend_2020_01["online_spend"].cast(DoubleType()))                       
              .withColumn("customers", df_spend_2020_01["raw_num_customers"].cast(DoubleType()))
              .withColumn("spend_by_dayArray", split(col("spend_by_day"),","))
              .where("brands IS NOT NULL"))
 
df_spend_2020_01_pnt = spend_2020_01_pnt \
            .st.set_time_fields("Date_start") \
            .st.set_geometry_field("SHAPE")

# Filter spend data for Denver area 
boundingbox_spend = df_spend_2020_01_pnt.selectExpr("placekey", "Date_start", "Date_end", "brands", "top_category", "sub_category", "Spend", "online_trans", "Tot_online_spend", "customers", "SHAPE", "ST_EnvIntersects(SHAPE, -105.2006, 39.545, -104.709366, 39.9883) as Filter")
spend_denver = boundingbox_spend.filter(boundingbox_spend['Filter'] == True)
 
# Time filter 
spend_2020_01_pnt_4th = spend_denver \
                        .withColumn("spend_4", F.element_at('spend_by_dayArray', 4).cast(DoubleType())) \
                        .withColumn("spend_5", F.element_at('spend_by_dayArray', 5).cast(DoubleType())) \
                        .withColumn("spend_45", (col("spend_4") + col("spend_5")))  \
                        .select("placekey", "brands", "top_category", "sub_category", "Spend", "online_trans", "Tot_online_spend", "customers", "SHAPE", "spend_4", "spend_5", "spend_45")

# Export data to ArcGIS Online
sdf = spend_2020_01_pnt_4th.st.to_pandas_sdf()
lyr = sdf.spatial.to_featurelayer('Safegraph_Spend_Jan04_2020_Denver')

Figure 5. Consumer spending patterns aggregated by POIs in Denver on Saturday, January 4, 2020.

Is there a correlation between where people spend money and where they spend time?

We overlayed the Figure 4 and 5 map layers and then filtered for high spend and locations where the most time was spent. The Figure 6 map shows that some urban POIs coincided with higher consumer spending but less dwell time, which suggests people tend to shop quickly and spend more time around recreational facilities such as parks and golf courses.

Figure 6. Visualizing aggregated spending of money and time at urban POIs in Denver on January 4, 2020.

Did COVID-19 pandemic affect people’s spending ($) patterns?

Events which pose significant or long-term risk to safety, such as pandemics, wildfires, and storms, directly or indirectly cause individuals to shelter-in-place and thus affect mobility patterns. Substantial changes in mobility over time can also influence or change people's routine spending patterns, whether in-person or online. In the next example, we asked how the use of streaming services, such as Netflix, changed during the COVID-19 pandemic unfolding in 2020 in the United States? To answer that, we analyzed the nationwide related streaming cable (%) attribute (i.e., percent of customers that also spent money on specific streaming or cable services during the month) available in SafeGraph Spend data from 2020 using GeoAnalytics Engine tools.

The results of our analysis below are visualized through an interactive Time Slider Map (Figure 7), using the ArcGIS Instant App. It displays aggregated monthly counts of Netflix subscribers across United States counties, and how Netflix subscriptions increased or decreased in different areas over time. The map shows that in 2020 many counties in the Eastern US experienced a spike in Netflix customers during summer, whereas some of the Western and Mid-West counties experienced the same during winter. GeoAnalytics Engine makes it easy to do these kinds of high-resolution spatiotemporal summarizations, which in turn enable further exploration of nuanced patterns. For example, subscriber growth rates could be evaluated in the context of demographic and market characteristics, to build models that help explain why some areas might have grown more than others. This type of information could be used to influence advertising, content strategy, or sales initiatives.

# Process SafeGraph Spend data of 2020
spend_2020_pnt = (df_spend_2020
              .withColumn("point", ST.transform(ST.point("longitude", "latitude", 4326), 2263))
              .withColumn("Date_start", F.to_timestamp(date_format("spend_date_range_start","yyyy-MM-dd HH:mm:ss"), "yyyy-MM-dd HH:mm:ss"))
              .withColumn("Date_end", F.to_timestamp(date_format("spend_date_range_start","yyyy-MM-dd HH:mm:ss"), "yyyy-MM-dd HH:mm:ss"))
              .withColumn("Spend", df_spend_2020["raw_total_spend"].cast(DoubleType())) 
              .withColumn("online_trans", df_spend_2020["online_transactions"].cast(IntegerType())) 
              .withColumn("Tot_online_spend", df_spend_2020["online_spend"].cast(DoubleType()))                       
              .withColumn("customers", df_spend_2020["raw_num_customers"].cast(DoubleType()))                   
              .where("brands IS NOT NULL"))

# Cross shopping pattern for streaming service Netflix: Get % customers of Netflix tied to physical POIs where # customers shopped, as well (Showing this for January only. Repeat for all other months)

df_online_netflix_01 = spend_2020_pnt.withColumn('Netflix_pct', get_json_object(spend_2020_pnt. related_streaming_cable_pct, f'$.Netflix').cast(DoubleType())).where("Netflix_pct IS NOT NULL").withColumn("Netflix_cnt", ((col("Netflix_pct")/100) * col("customers")))

# Import United States County feature service
county = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/USA_Counties_Generalized/FeatureServer/0"
df_county = spark.read.format("feature-service").load(county).withColumn("shape", ST.transform("shape", 4326))

# Aggregate Netflix customer counts into Counties 
from geoanalytics.tools import AggregatePoints
 
result_netflix = AggregatePoints().setPolygons(df_county) \
        .addSummaryField(summary_field="Netflix_cnt", statistic="Sum") \
        .run(df_online_netflix_01).withColumn("Month", F.to_date(F.lit("2020-01"))).selectExpr("FID", "OBJECTID", "NAME", "STATE_NAME", "POPULATION", "POP_SQMI", "shape", "COUNT", "SUM_Netflix_cnt", "cast(Month as string) Month")  ## Showing this for January only. Repeat for all other months. 

# Export data to ArcGIS Online
sdf = result_netflix.st.to_pandas_sdf()
lyr = sdf.spatial.to_featurelayer('Customer_cnt_netflix_2020')

Fig 7. Time slider web app showing monthly change of Netflix customer size: county-level counts of Netflix users over the first year of the pandemic (2020).

Conclusion and roadmap

In this blog, we introduced Esri’s latest, Spark optimized geoanalytics offering – ArcGIS GeoAnalytics Engine – for the broader Data Science community. We presented cross-industry use cases accompanied by sample code in an Azure Synapse Analytics notebook that show diverse capabilities and utilities of GeoAnalytics Engine to solve challenging business problems with speed and scale. Now you have all the geoanalytics you need in a single, cloud-native library that blends seamlessly with open-source packages for a smooth workflow.

A few updates in GeoAnalytics Engine will be available soon with the future release, which will include (but is not limited to) the following capabilities: Snap Tracks (to correct for GPS drift of moving vehicles), support for GeoJSON and GeoParquet, generating optimal coordinate systems, H3 binning, basemap support for visualization, and clustering algorithms such as K-Nearest Neighbor.

Please reach out to your Microsoft and Esri account teams for details around GeoAnalytics Engine deployment. To try SafeGraph data in Esri, download this free sample dataset on the marketplace. To learn more about GeoAnalytics Engine and explore how to gain access, please visit Esri's product page for ArcGIS GA Engine.

Updated Feb 14, 2023

Version 2.0

community

synapse spark

zac_hall