By: Zoe Statman-Weil & Mark Mathis, Impact Observatory, Inc.
Land use and land cover (LULC) maps are used by decision makers in governments, civil society, industries, and finance to observe how the world is changing, and to understand and manage the impact of their actions. Historically, LULC maps are produced using expensive, semi-automated techniques requiring significant human input and thus leading to significant delays between collection of satellite images and production of maps, limiting the ability to get regular and frequent temporal updates to users. Making the detailed, accurate maps the whole world needs to understand our rapidly changing planet with timely updates requires automation. A groundbreaking artificial intelligence-powered 2020 global LULC map was produced for Esri on Microsoft Azure by Impact Observatory, a mission-driven technology company bringing AI algorithms and on-demand data to environmental monitoring and sustainability risk analysis. This map will be used to help decision makers address challenges in climate change mitigation and adaptation, biodiversity preservation, and sustainable development.
The Impact Observatory LULC machine learning (ML) model was trained on an Azure NC12s v2 virtual machine (VM) powered by NVIDIA® Tesla® P100 GPUs using over 5 billion pixels hand-labeled into one of ten classes: trees, water, built area, scrub/shrub, flooded vegetation, bare ground, cropland, grassland, snow/ice, and clouds. The model was then deployed over more than 450,000 Copernicus Sentinel-2 Level-2A 10-meter resolution, surface reflectance corrected images, each 100 km x 100km in size and totaling 500 terabytes of satellite imagery (1 terabyte = 1012 bytes) hosted on the Microsoft Planetary Computer. The processing leveraged geospatial open standards, Azure Batch, and other Azure resources to efficiently produce the final dataset at scale and at a low cost.
The Microsoft Planetary Computer and Impact Observatory (IO) make extensive use of geospatial open standards, specifically Cloud Optimized GeoTIFF (COG) and Spatial Temporal Asset Catalog (STAC). Use of these standards enabled the team to produce the Esri 2020 Land Cover map using distributed processing at scale.
GeoTIFF is a widely used open standard for geospatial data based on the common TIFF image file format, able to support imagery with bands beyond the usual red, green, blue visible light bands, and containing additional metadata to locate the image on the surface of the Earth. A COG is a regular GeoTIFF file, aimed at being hosted on a HTTP file server, with an internal organization that enables more efficient workflows on the cloud. Not only can COGs be read from the cloud without needing to duplicate the data to a local filesystem, but a portion of the file can be read using HTTP GET Range requests allowing for targeted reading and efficient processing. Azure Blob Storage is an ideal solution for hosting COGs as it is an unstructured data storage system accessible via HTTP requests. The LULC map was produced using Sentinel-2 COGs hosted on Microsoft’s Planetary Computer in Blob Storage, and all prediction rasters produced from the model were saved as COGs to Blob Storage.
The STAC specification is a common language used to index geospatial data for easy search and discovery. IO searched the Planetary Computer’s STAC catalog to identify Sentinel-2 imagery for certain locations, times, and cloud coverage. IO applied a community supported implementation of the STAC interface to create its own STAC catalog on Azure App Services with Azure Database for PostgreSQL as the underlying data store. IO’s STAC catalog was used to index data throughout the model deployment pipeline and thus served as both a tool for checkpointing pipeline progress, as well as indexing the final product.
COGs and STAC, both easily leveraged in Azure, provide a scalable and highly flexible framework for processing geospatial data.
Azure Batch was used by IO to efficiently deploy the model over satellite images in parallel at a large scale. IO bundled the ML model, and deployment and processing code into Docker containers, and ran Batch tasks within these containers on a Batch pool of compute nodes.
The data processing pipeline consisted of three primary tasks: 1) Deploying the model over one 100 km x 100 km Sentinel-2 COG by chipping it into hundreds of overlapping 5 km X 5 km smaller images, running those chips through the model, and finally merging the chips back together; 2) Computing a class weighted mode across all model predictions for a given Sentinel-2 image footprint; and 3) Combining the class weighted modes produced in #2 for a given Military Grid Reference System (MGRS) zone into one COG. IO relied heavily on Batch’s task dependency capabilities, which allowed, for example, for the class-weighted mode task (#2) to only be scheduled for execution when the relevant set of model deployment tasks (#1) were completed successfully.
While the model was trained on a GPU-enabled VM, the model deployment over the image chips was executed on CPU-based virtual machines, enabling resource efficient computation at scale. Due to the task dependent nature of the pipeline, all tasks needed to be run on the same pool, and thus the same VM type. RAM and network bandwidth requirements fluctuated for the tasks, but the high CPU usage ended up being the defining factor in VM choice. In the end, the data was processed on low-priority Standard Azure D4 V2 virtual machine powered by Intel® Xeon® scalable processors with seven task slots allocated per node.
It took over one million core hours to process the data for the entire LULC map. With the scaling flexibility of Batch, IO was able to process over 10% of the earth’s surface a day. The completed Esri 2020 Land Cover map is now freely available on Esri Living Atlas and the Microsoft Planetary Computer.
For additional information visit https://www.impactobservatory.com/
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.