Forum Discussion

dubeyneeraj's avatar
dubeyneeraj
Copper Contributor
Aug 04, 2023

Copy csv files with varying Schema as Parquet from

Hi

We have  Azure Synapse Link in Place update Profile created. As a result Periodically file (csv format) get sync  in Azure Synapse Workspace. The Schema of the Files are maintained in model.json

We want to copy these files using Notebook as Parquet to another Data lake Storage.
Challenge: when  the Schema is updated (new columns added) for any entity ,corresponding csv files schema changes.So  we have multiple csv files with varying schema but model.json has most reent schema.

how to copy the files.

We tried but always getting error "The number of columns in CSV/parquet file is not equal to the number of fields in Spark StructType. Either modify the attributes in manifest to make it equal to the number of columns in CSV/parquet files or modify the csv/parquet file"

-Code:

 # defining an empty data frame to append data from all incremental folders for a given entity
    emp_RDD = spark.sparkContext.emptyRDD()
    columns = StructType([])
    emp_data = spark.createDataFrame(data = emp_RDD,
                                schema = columns)

    df = spark.read.format("com.microsoft.cdm")\
                .option("storage",f"{storage_account_name}.dfs.core.windows.net")\
                .option("manifestPath",f"{container_name}/model.json")\
                .option("entity",entity_name)\
                .load()
 
when writing the file it fails

 

 

 

No RepliesBe the first to reply

Resources