Read & write parquet files using Apache Spark in Azure Synapse Analytics

Published Jun 11 2020 04:45 AM 9,638 Views
Microsoft

Apache Spark in Azure Synapse Analytics enables you easily read and write parquet files placed on Azure storage. Apache Spark provides the following concepts that you can use to work with parquet files:

 

  • DataFrame.read.parquet function that reads content of parquet file using PySpark
  • DataFrame.write.parquet function that writes content of data frame into a parquet file using PySpark
  • External table that enables you to select or insert data in parquet file(s) using Spark SQL.

In the following sections you will see how can you use these concepts to explore the content of files and write new data in the parquet file. As a prerequisite, you need to have:

- Azure storage account (deltaformatdemostorage.dfs.core.windows.net in the examples below) with a container (parquet in the examples below) where your Azure AD user has read/write permissions

- Azure Synapse workspace with created Apache Spark pool.

Writing parquet files 

PySpark enables you to create objects, load them into data frame and store them on Azure storage using data frames and DataFrame.write.parquet() function:

 

# Define content 
Employee = Row("firstName", "lastName", "email", "salary")

employee1 = Employee('Јован', 'Поповић', 'jovan.popovic@contoso.com', 100000)
employee2 = Employee('John', 'Doe', 'john.doe@contoso.com', 120000 )
employee3 = Employee('', None, 'casper@contoso.com', 160000 )
employee4 = Employee('Confucius', '孔子', 'confucius@contoso.cocom', 160000 )

employees = [employee1, employee2, employee3, employee4]
df = spark.createDataFrame(employees)
display(df)

df.write.parquet("abfss://parquet@deltaformatdemostorage.dfs.core.windows.net/employees")

 

Note that this code will create a set of parquet files on Azure Storage.

Reading parquet files

Once you create a parquet file, you can read its content using DataFrame.read.parquet() function:

 

# read content of file
df = spark.read.parquet('abfss://parquet@deltaformatdemostorage.dfs.core.windows.net/employees')

df.show(10)

 

The result of this query can be executed in Synapse Studio notebook.

Creating tables on parquet files

Apache Spark enables you to access your parquet files using table API. You can create external table on a set of parquet files using the following code:

 

%%sql
CREATE TABLE employees USING PARQUET
LOCATION 'abfss://parquet@deltaformatdemostorage.dfs.core.windows.net/employees'

 

Once you have created your external table your can read the content of parquet files using Spark SQL language:

 

%%sql
SELECT *
FROM employees

 

You can also insert new records into the parquet files using INSERT statement:

 

%%sql
INSERT INTO employees 
VALUES ('Nikola', 'Tesla', 110000, 'nikola.tesla.@contoso.com')

 

 

NOTE: Apache Spark don't enables you to update/delete records in parquet tables. You need to convert parquet to DeltaFormat if you want to update content of parquet files.

 

Spark SQL provides concepts like tables and SQL query language that can simplify your access code.

 

Conclusion

Apache Spark engine in Azure Synapse Analytics enables you to easily process your parquet files on Azure Storage. Lear more abut the capabilities of Apache spark engine in Azure Synapse Analytics in documentation.

 

Version history
Last update:
‎Jun 11 2020 05:40 AM
Updated by: