How To Convert Parquet Files into GeoJson Files and Save it in Data Lake using Synapse Notebooks
Published Mar 14 2023 01:10 AM 3,149 Views
Microsoft

Introduction:

GeoJSON files contain various types of geospatial data, such as point, line, and polygon features, as well as metadata and attributes. They can be used for a variety of purposes, such as creating interactive maps, analyzing spatial patterns, and visualizing geospatial data.

In this blog, we will discuss how to transform Parquet files into GeoJSON files using Synapse Notebook, which is a workaround since this transformation is not currently supported in the Copy activity in Azure Synapse pipelines.

 

Prerequisites:

  1. Basic knowledge in Azure Synapse Analytics.

  2. Workspace in Azure Synapse Analytics.

  3. Storage account (in this blog, we are using ADLS) linked to the Synapse workspace.

  4. Python and PySpark knowledge.

  5. Mock data (in this example, a Parquet file that was generated from a CSV containing 3 columns: name, latitude, and longitude).

Sally_Dabbah_0-1678172890654.png

 

Step 1: Create a Notebook in Azure Synapse Workspace

To create a notebook in Azure Synapse Workspace, click on Synapse Studio, then navigate to the Develop tab, and select Notebooks. From there, you can create a new notebook.

 

 

Step 2:

  • Attach a Spark Pool to the Notebook

    You can create your own Spark pool or attach the default one. 


    Sally_Dabbah_1-1678173255316.png
  • In the language drop-down list, select PySpark.
    Sally_Dabbah_2-1678173320282.png

     

  • In the notebook, open a code tab to install all the relevant packages that we will use later on:
    pip install geojson geopandas​

    Next, open another code tab. In this tab, we will generate a GeoPandas DataFrame out of the Parquet files.

    %%pyspark
    from pyspark.sql import SparkSession
    from notebookutils import mssparkutils
    from geojson import Feature, FeatureCollection, Point , dump
    import pandas as pd
    import geopandas
    import json
    
    blob_account_name = "XXXX"
    blob_container_name = "XXX"
    
    sc = SparkSession.builder.getOrCreate()
    token_library = sc._jvm.com.microsoft.azure.synapse.tokenlibrary.TokenLibrary
    blob_sas_token = token_library.getConnectionString("AzureBlobStorage")
    output_path = 'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/output/test.geojson'
    
    spark.conf.set(
        'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
        blob_sas_token)
    df = spark.read.load('wasbs://{blob_container_name}}@{blob_account_name}.blob.core.windows.net/staging/testVlaamse.parquet', format='parquet')
    pdf = df.toPandas()
    
    
    #Converting Pandas DF into geoPandasDF
    features = pdf.apply(
        lambda row: Feature(geometry=Point((float(row['lng']), float(row['lat'])))),
        axis=1).tolist()
    
    # all the other columns used as properties
    properties = pdf.drop(['lat', 'lng'], axis=1).to_dict('records')
    
    # whole geojson object
    feature_collection = FeatureCollection(features=features, properties=properties)
    
    gdf = geopandas.GeoDataFrame.from_features(feature_collection['features'])
    
    print(gdf) #checking geopandas dataframe structure.​
    the output should look like this:
    Sally_Dabbah_3-1678173816867.png

    Open another code tab and let's use the Spark utils library provided by Microsoft to write the GeoPandas DataFrame as a GeoJSON file and save it in Azure Data Lake Gen 2.
    Unfortunately, copying the GeoPandas DataFrame directly from Synapse Notebook to Azure Data Lake Gen 2 is not yet supported. Therefore, we will use a workaround by writing the GeoPandas DataFrame into a local temporary file and then copying the file into Azure Data Lake Gen 2.

    Here's the code for copying the file into Azure Data Lake Gen 2:


    from notebookutils import mssparkutils
    tmp_file = 'file:/tmp/temporary/test.geojson'
    
    mssparkutils.fs.put(tmp_file, gdf.to_string(), True) # Set the last parameter as True to overwrite the file if it existed already
    mssparkutils.fs.cp('file:/tmp/temporary/test.geojson','wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/output')​

Links:

Apache Spark pool concepts - Azure Synapse Analytics | Microsoft Learn 

Introduction to Microsoft Spark utilities - Azure Synapse Analytics | Microsoft Learn 

 

Call-to-Action:
If you have any questions, comments, or feedback about this topic, please feel free to share them in the comments section below. Don't forget to subscribe to our blog for more Microsoft-related content and updates.

Co-Authors
Version history
Last update:
‎Mar 06 2023 11:41 PM
Updated by: