Introduction:
GeoJSON files contain various types of geospatial data, such as point, line, and polygon features, as well as metadata and attributes. They can be used for a variety of purposes, such as creating interactive maps, analyzing spatial patterns, and visualizing geospatial data.
In this blog, we will discuss how to transform Parquet files into GeoJSON files using Synapse Notebook, which is a workaround since this transformation is not currently supported in the Copy activity in Azure Synapse pipelines.
Prerequisites:
-
Basic knowledge in Azure Synapse Analytics.
-
Workspace in Azure Synapse Analytics.
-
Storage account (in this blog, we are using ADLS) linked to the Synapse workspace.
-
Python and PySpark knowledge.
-
Mock data (in this example, a Parquet file that was generated from a CSV containing 3 columns: name, latitude, and longitude).
Step 1: Create a Notebook in Azure Synapse Workspace
To create a notebook in Azure Synapse Workspace, click on Synapse Studio, then navigate to the Develop tab, and select Notebooks. From there, you can create a new notebook.
Step 2:
-
Attach a Spark Pool to the Notebook
You can create your own Spark pool or attach the default one.
- In the language drop-down list, select PySpark.
- In the notebook, open a code tab to install all the relevant packages that we will use later on:
pip install geojson geopandas
Next, open another code tab. In this tab, we will generate a GeoPandas DataFrame out of the Parquet files.
the output should look like this:%%pyspark from pyspark.sql import SparkSession from notebookutils import mssparkutils from geojson import Feature, FeatureCollection, Point , dump import pandas as pd import geopandas import json blob_account_name = "XXXX" blob_container_name = "XXX" sc = SparkSession.builder.getOrCreate() token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary blob_sas_token = token_library.getConnectionString("AzureBlobStorage") output_path = 'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/output/test.geojson' spark.conf.set( 'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name), blob_sas_token) df = spark.read.load('wasbs://{blob_container_name}}@{blob_account_name}.blob.core.windows.net/staging/testVlaamse.parquet', format='parquet') pdf = df.toPandas() #Converting Pandas DF into geoPandasDF features = pdf.apply( lambda row: Feature(geometry=Point((float(row['lng']), float(row['lat'])))), axis=1).tolist() # all the other columns used as properties properties = pdf.drop(['lat', 'lng'], axis=1).to_dict('records') # whole geojson object feature_collection = FeatureCollection(features=features, properties=properties) gdf = geopandas.GeoDataFrame.from_features(feature_collection['features']) print(gdf) #checking geopandas dataframe structure.
Open another code tab and let's use the Spark utils library provided by Microsoft to write the GeoPandas DataFrame as a GeoJSON file and save it in Azure Data Lake Gen 2.
Unfortunately, copying the GeoPandas DataFrame directly from Synapse Notebook to Azure Data Lake Gen 2 is not yet supported. Therefore, we will use a workaround by writing the GeoPandas DataFrame into a local temporary file and then copying the file into Azure Data Lake Gen 2.Here's the code for copying the file into Azure Data Lake Gen 2:
from notebookutils import mssparkutils tmp_file = 'file:/tmp/temporary/test.geojson' mssparkutils.fs.put(tmp_file, gdf.to_string(), True) # Set the last parameter as True to overwrite the file if it existed already mssparkutils.fs.cp('file:/tmp/temporary/test.geojson','wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/output')
Links:
Apache Spark pool concepts - Azure Synapse Analytics | Microsoft Learn
Introduction to Microsoft Spark utilities - Azure Synapse Analytics | Microsoft Learn
Call-to-Action:
If you have any questions, comments, or feedback about this topic, please feel free to share them in the comments section below. Don't forget to subscribe to our blog for more Microsoft-related content and updates.