Apache Spark in Azure Synapse Analytics enables you easily read and write parquet files placed on Azure storage. Apache Spark provides the following concepts that you can use to work with parquet files:
DataFrame.read.parquet function that reads content of parquet file using PySpark
DataFrame.write.parquet function that writes content of data frame into a parquet file using PySpark
External table that enables you to select or insert data in parquet file(s) using Spark SQL.
In the following sections you will see how can you use these concepts to explore the content of files and write new data in the parquet file. As a prerequisite, you need to have:
- Azure storage account (deltaformatdemostorage.dfs.core.windows.net in the examples below) with a container (parquet in the examples below) where your Azure AD user has read/write permissions
Apache Spark enables you to access your parquet files using table API. You can create external table on a set of parquet files using the following code:
CREATE TABLE employees USING PARQUET
Once you have created your external table your can read the content of parquet files using Spark SQL language:
You can also insert new records into the parquet files using INSERT statement:
INSERT INTO employees
VALUES ('Nikola', 'Tesla', 110000, 'nikola.tesla.@contoso.com')
NOTE: Apache Spark don't enables you to update/delete records in parquet tables. You need to convert parquet to DeltaFormat if you want to update content of parquet files.
Spark SQL provides concepts like tables and SQL query language that can simplify your access code.
Apache Spark engine in Azure Synapse Analytics enables you to easily process your parquet files on Azure Storage. Lear more abut the capabilities of Apache spark engine in Azure Synapse Analytics in documentation.