Azure Synapse Analytics Blog

5 MIN READ

Introduce a Notebook gallery image to process Geospatial data from Planetary Computer with STAC API

Microsoft

Oct 25, 2022

Introduction

Azure Space team is introducing a spaceborne data processing Notebook, which has been published to Azure Synapse Analytics Gallery. The Notebook uses STAC API (SpatioTemporal Asset Catalog) to search and download geospatial data from Microsoft Planetary Computer to an Azure Storage account and perform basic geospatial transformation.

What is Azure Space?

Azure Space is a set of products and capabilities to make cloud connectivity and compute using spaceborne data and AI that allow you to discover and distribute the most valuable insights. More specifically, Azure Space provides the ability to downlink spaceborne data from Azure Orbital Ground Station (AOGS), first or third-party archives, or customer-acquired data directly into Azure. The raw spaceborne sensor data is converted into analysis-ready data using geospatial data processing pipelines.

What problem are we solving?

Spaceborne data originating from remote sensors are characterized by high volume, high velocity, and a high degree of variety. How to ingest and store massive volume of spaceborne data from different sources, transform and process the data in a repeatable workflow, integrate with best-in-class ML models to drive analysis and extraction of business insights is the core problem we strive to solve. The notebook demonstrates the capability to tackle ingesting and transforming spaceborne data in an automated workflow. We’ll cover integration of AI/ML in a future blog.

Simple Image Procurement and Processing from Planetary Computer STAC API.

We will run through the steps involved in getting data from an area of interest that we define, downloading it and performing a simple geospatial transformation. Please check out the demo video for a detailed walkthrough of running the Notebook in Synapse Studio.

i. Pre-requisites

Before starting the steps in this Notebook, the following are a few things to setup:

a. Storage Account (Gen 2)

ii. Setup

This Notebook requires that your Synapse Workspace instance be setup with:

a. Python Packages Install through requirements.yml file

Upload the requirements.yml below to your Spark Pool to install the required Python package pre-requisites by following the step in this document

name: demo 
channels: 
  - conda-forge 
  - defaults 
dependencies: 
  - pystac-client 
  - rich 
  - planetary-computer 
  - gdal 
  - pip: 
    - geopandas

b. Create a Linked Service to the Storage Account by following the steps in this document

c. Create a storage account container called my-container in your storage account.

Reading Data from the STAC API

The Planetary Computer catalogs the datasets we host using the STAC (SpatioTempoaral Asset Catalog) specification. It provides a STAC API endpoint that can be used to search the datasets by space, time and more.

from pystac_client import Client

catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

Mount Azure Storage Account

Azure Synapse Analytics has two new mount/unmount APIs in mssparkutils package, you can use to mount / attach remote storage (Blob, Gen2, Azure File Share) to all working nodes (driver node and workder nodes), after that, you can access data in storage as if they were one of the local file system with local file API.

Substitute the values of your Storage account, Container and Linked Service in the next code section.

from notebookutils import mssparkutils

mssparkutils.fs.unmount("/test")
mssparkutils.fs.mount( 
    "abfss://my-container@storage-account-name.dfs.core.windows.net", 
    "/test", 
    {"linkedService":"mygen2account"} 
)

True

Searching

Using STAC API, you can search for assets meeting certain criteria. This might include the most common use cases for Geospatial data like spatial extent, date and time to other not so common attributes like cloud cover.

In this example, we'll search for imagery from Landsat Collection 2 Level-2 area around Houston, Texas.

time_range = "2021-12-01/2022-12-31"  
  
#  Bounding box must be within (-180, -90, 180, 90)  
bbox = [-95.42668576006342, 29.703952013356414, -95.27826526902945, 29.745286282512772]  
  
search = catalog.search(collections=["sentinel-2-l2a"], bbox=bbox, datetime=time_range)  
items = search.get_all_items()  
  
len(items)

100

Each pystac.Item <https://pystac.readthedocs.io/en/stable/api/pystac.html#pystac.Item>__ in this ItemCollection includes all the metadata for that scene. STAC Items are GeoJSON features, and so can be loaded by libraries like geopandas.

import geopandas  
  
df = geopandas.GeoDataFrame.from_features(items.to_dict(), crs="epsg:4326")  
df

Next, sort the items by cloudiness and select one with low cloudiness

selected_item = min(items, key=lambda item: item.properties["eo:cloud_cover"])  
selected_item

Next, we will load the preview image from the selected item and display it using the Image package. This will shows us a preview of the area of interest.

from IPython.display import Image

Image(url=selected_item.assets["rendered_preview"].href, width=500)

import rich.table

table = rich.table.Table("Asset Key", "Descripiption")
for asset_key, asset in selected_item.assets.items():
    # print(f"{asset_key:<25} - {asset.title}")
    table.add_row(asset_key, asset.title)

table

┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Asset Key          ┃ Descripiption                         ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ AOT                │ Aerosol optical thickness (AOT)       │
│ B01                │ Band 1 - Coastal aerosol - 60m        │
│ B02                │ Band 2 - Blue - 10m                   │
│ B03                │ Band 3 - Green - 10m                  │
│ B04                │ Band 4 - Red - 10m                    │
│ B05                │ Band 5 - Vegetation red edge 1 - 20m  │
│ B06                │ Band 6 - Vegetation red edge 2 - 20m  │
│ B07                │ Band 7 - Vegetation red edge 3 - 20m  │
│ B08                │ Band 8 - NIR - 10m                    │
│ B09                │ Band 9 - Water vapor - 60m            │
│ B11                │ Band 11 - SWIR (1.6) - 20m            │
│ B12                │ Band 12 - SWIR (2.2) - 20m            │
│ B8A                │ Band 8A - Vegetation red edge 4 - 20m │
│ SCL                │ Scene classfication map (SCL)         │
│ WVP                │ Water vapour (WVP)                    │
│ visual             │ True color image                      │
│ preview            │ Thumbnail                             │
│ safe-manifest      │ SAFE manifest                         │
│ granule-metadata   │ Granule metadata                      │
│ inspire-metadata   │ INSPIRE metadata                      │
│ product-metadata   │ Product metadata                      │
│ datastrip-metadata │ Datastrip metadata                    │
│ tilejson           │ TileJSON with default rendering       │
│ rendered_preview   │ Rendered preview                      │
└────────────────────┴───────────────────────────────────────┘

Download the Raster Data

GeoTiff cannot be downloaded from Planetary Computer using the link directly as it results in a 404. That's because the Planetary Computer uses Azure Blob Storage SAS Tokens to enable access to the data.

To get a token for access, you can use the Planetary Computer's Data Authentication API. You can access that anonymously, or you can provide an API Key for higher rate limits and longer-lived token.s

Using the SAS Token generated via Planetary Computer library, we download the data to the Storage Account for further processing.

import planetary_computer  
import requests  
  
jobId = mssparkutils.env.getJobId()  
  
signed_href = planetary_computer.sign(selected_item).assets["visual"].href  
  
print(signed_href)  
  
response = requests.get(signed_href)  
  
open("/synfs/%s/test/sample_image.tif" % jobId, 'wb').write(response.content)

https://sentinel2l2a01.blob.core.windows.net/...

Read the Raster Data Information

We start by reading the GDAL data information. To do this, we use gdal.info() method on the downloaded GeoTiff file. The gdalinfo will report on format driver used to access the file, raster size, coordinate system and so on.

Raster data in the form of GeoTiff will be retrieve from the storage account, where we saved in the previous step by directly calling the STAC API. Since the Storage Account is mounted we will use the mounted file path that starts with the prefix synfs. Storage account is mounted under a jobId specific to the run.

from osgeo import gdal  

gdal.UseExceptions()
dataset_info = gdal.Info("/synfs/%s/test/sample_image.tif" % jobId) 

print(dataset_info)

Convert GeoTiff to PNG

The conversion of GeoTiff to PNG can be performed using GDAL Translate. The gdal translate utility can be used to convert raster data between different formats, potentially performing some operations like subsettings, resampling and rescaling pixels in the process.

The resultant PNG file is saved back to the Storage Account.

tiff_in = "/synfs/%s/test/sample_image.tif" % jobId  
png_out = "/synfs/%s/test/sample_image.png" % jobId  
  
options = gdal.TranslateOptions(format='PNG')  
  
gdal.Translate(png_out, tiff_in, options=options)

<osgeo.gdal.Dataset; proxy of <Swig Object of type 'GDALDatasetShadow *' at 0x7f384effd120> >

Updated Nov 02, 2022

Version 4.0

Microsoft

Joined November 02, 2018

View Profile

Azure Synapse Analytics Blog

Follow this blog board to get notified when there's new activity