How To Convert Parquet Files into GeoJson Files and Save it in Data Lake using Synapse Notebooks

Microsoft

Mar 14, 2023

Introduction:

GeoJSON files contain various types of geospatial data, such as point, line, and polygon features, as well as metadata and attributes. They can be used for a variety of purposes, such as creating interactive maps, analyzing spatial patterns, and visualizing geospatial data.

In this blog, we will discuss how to transform Parquet files into GeoJSON files using Synapse Notebook, which is a workaround since this transformation is not currently supported in the Copy activity in Azure Synapse pipelines.

Prerequisites:

Basic knowledge in Azure Synapse Analytics.
Workspace in Azure Synapse Analytics.
Storage account (in this blog, we are using ADLS) linked to the Synapse workspace.
Python and PySpark knowledge.
Mock data (in this example, a Parquet file that was generated from a CSV containing 3 columns: name, latitude, and longitude).

Step 1: Create a Notebook in Azure Synapse Workspace

To create a notebook in Azure Synapse Workspace, click on Synapse Studio, then navigate to the Develop tab, and select Notebooks. From there, you can create a new notebook.

Step 2:

Attach a Spark Pool to the Notebook

You can create your own Spark pool or attach the default one.
In the language drop-down list, select PySpark.

In the notebook, open a code tab to install all the relevant packages that we will use later on:

pip install geojson geopandas

Next, open another code tab. In this tab, we will generate a GeoPandas DataFrame out of the Parquet files.

%%pyspark
from pyspark.sql import SparkSession
from notebookutils import mssparkutils
from geojson import Feature, FeatureCollection, Point , dump
import pandas as pd
import geopandas
import json

blob_account_name = "XXXX"
blob_container_name = "XXX"

sc = SparkSession.builder.getOrCreate()
token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
blob_sas_token = token_library.getConnectionString("AzureBlobStorage")
output_path = 'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/output/test.geojson'

spark.conf.set(
    'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
    blob_sas_token)
df = spark.read.load('wasbs://{blob_container_name}}@{blob_account_name}.blob.core.windows.net/staging/testVlaamse.parquet', format='parquet')
pdf = df.toPandas()


#Converting Pandas DF into geoPandasDF
features = pdf.apply(
    lambda row: Feature(geometry=Point((float(row['lng']), float(row['lat'])))),
    axis=1).tolist()

# all the other columns used as properties
properties = pdf.drop(['lat', 'lng'], axis=1).to_dict('records')

# whole geojson object
feature_collection = FeatureCollection(features=features, properties=properties)

gdf = geopandas.GeoDataFrame.from_features(feature_collection['features'])

print(gdf) #checking geopandas dataframe structure.

the output should look like this:

Open another code tab and let's use the Spark utils library provided by Microsoft to write the GeoPandas DataFrame as a GeoJSON file and save it in Azure Data Lake Gen 2.
Unfortunately, copying the GeoPandas DataFrame directly from Synapse Notebook to Azure Data Lake Gen 2 is not yet supported. Therefore, we will use a workaround by writing the GeoPandas DataFrame into a local temporary file and then copying the file into Azure Data Lake Gen 2.

Here's the code for copying the file into Azure Data Lake Gen 2:

from notebookutils import mssparkutils
tmp_file = 'file:/tmp/temporary/test.geojson'

mssparkutils.fs.put(tmp_file, gdf.to_string(), True) # Set the last parameter as True to overwrite the file if it existed already
mssparkutils.fs.cp('file:/tmp/temporary/test.geojson','wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/output')

Links:

Apache Spark pool concepts - Azure Synapse Analytics | Microsoft Learn

Introduction to Microsoft Spark utilities - Azure Synapse Analytics | Microsoft Learn

Call-to-Action:
If you have any questions, comments, or feedback about this topic, please feel free to share them in the comments section below. Don't forget to subscribe to our blog for more Microsoft-related content and updates.

Updated Mar 07, 2023

Version 1.0

data & ai

Sally_Dabbah

Microsoft

Joined July 10, 2022

View Profile

FastTrack for Azure

Follow this blog board to get notified when there's new activity

Blog Post

How To Convert Parquet Files into GeoJson Files and Save it in Data Lake using Synapse Notebooks