Introduction:
GeoJSON files contain various types of geospatial data, such as point, line, and polygon features, as well as metadata and attributes. They can be used for a variety of purposes, such as creating interactive maps, analyzing spatial patterns, and visualizing geospatial data.
In this blog, we will discuss how to transform Parquet files into GeoJSON files using Synapse Notebook, which is a workaround since this transformation is not currently supported in the Copy activity in Azure Synapse pipelines.
Prerequisites:
Basic knowledge in Azure Synapse Analytics.
Workspace in Azure Synapse Analytics.
Storage account (in this blog, we are using ADLS) linked to the Synapse workspace.
Python and PySpark knowledge.
Mock data (in this example, a Parquet file that was generated from a CSV containing 3 columns: name, latitude, and longitude).
Step 1: Create a Notebook in Azure Synapse Workspace
To create a notebook in Azure Synapse Workspace, click on Synapse Studio, then navigate to the Develop tab, and select Notebooks. From there, you can create a new notebook.
Step 2:
Attach a Spark Pool to the Notebook
You can create your own Spark pool or attach the default one.
pip install geojson geopandas
Next, open another code tab. In this tab, we will generate a GeoPandas DataFrame out of the Parquet files.
%%pyspark
from pyspark.sql import SparkSession
from notebookutils import mssparkutils
from geojson import Feature, FeatureCollection, Point , dump
import pandas as pd
import geopandas
import json
blob_account_name = "XXXX"
blob_container_name = "XXX"
sc = SparkSession.builder.getOrCreate()
token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
blob_sas_token = token_library.getConnectionString("AzureBlobStorage")
output_path = 'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/output/test.geojson'
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
df = spark.read.load('wasbs://{blob_container_name}}@{blob_account_name}.blob.core.windows.net/staging/testVlaamse.parquet', format='parquet')
pdf = df.toPandas()
#Converting Pandas DF into geoPandasDF
features = pdf.apply(
lambda row: Feature(geometry=Point((float(row['lng']), float(row['lat'])))),
axis=1).tolist()
# all the other columns used as properties
properties = pdf.drop(['lat', 'lng'], axis=1).to_dict('records')
# whole geojson object
feature_collection = FeatureCollection(features=features, properties=properties)
gdf = geopandas.GeoDataFrame.from_features(feature_collection['features'])
print(gdf) #checking geopandas dataframe structure.
the output should look like this:Open another code tab and let's use the Spark utils library provided by Microsoft to write the GeoPandas DataFrame as a GeoJSON file and save it in Azure Data Lake Gen 2.
Unfortunately, copying the GeoPandas DataFrame directly from Synapse Notebook to Azure Data Lake Gen 2 is not yet supported. Therefore, we will use a workaround by writing the GeoPandas DataFrame into a local temporary file and then copying the file into Azure Data Lake Gen 2.
Here's the code for copying the file into Azure Data Lake Gen 2:
from notebookutils import mssparkutils
tmp_file = 'file:/tmp/temporary/test.geojson'
mssparkutils.fs.put(tmp_file, gdf.to_string(), True) # Set the last parameter as True to overwrite the file if it existed already
mssparkutils.fs.cp('file:/tmp/temporary/test.geojson','wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/output')
Links:
Apache Spark pool concepts - Azure Synapse Analytics | Microsoft Learn
Introduction to Microsoft Spark utilities - Azure Synapse Analytics | Microsoft Learn
Call-to-Action:
If you have any questions, comments, or feedback about this topic, please feel free to share them in the comments section below. Don't forget to subscribe to our blog for more Microsoft-related content and updates.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.