Forum Discussion
dubeyneeraj
Aug 04, 2023Copper Contributor
Copy csv files with varying Schema as Parquet from
Hi
We have Azure Synapse Link in Place update Profile created. As a result Periodically file (csv format) get sync in Azure Synapse Workspace. The Schema of the Files are maintained in model.json
We want to copy these files using Notebook as Parquet to another Data lake Storage.
Challenge: when the Schema is updated (new columns added) for any entity ,corresponding csv files schema changes.So we have multiple csv files with varying schema but model.json has most reent schema.
how to copy the files.
We tried but always getting error "The number of columns in CSV/parquet file is not equal to the number of fields in Spark StructType. Either modify the attributes in manifest to make it equal to the number of columns in CSV/parquet files or modify the csv/parquet file"
-Code:
# defining an empty data frame to append data from all incremental folders for a given entity
emp_RDD = spark.sparkContext.emptyRDD()
columns = StructType([])
emp_data = spark.createDataFrame(data = emp_RDD,
schema = columns)
df = spark.read.format("com.microsoft.cdm")\
.option("storage",f"{storage_account_name}.dfs.core.windows.net")\
.option("manifestPath",f"{container_name}/model.json")\
.option("entity",entity_name)\
.load()
when writing the file it fails
No RepliesBe the first to reply