Forum Discussion

Ashwini_Akula's avatar
Ashwini_Akula
Copper Contributor
Aug 05, 2020

Unable to write csv to azure blob storage using Pyspark

Hi there,

 

I am trying to write a csv to an azure blob storage using pyspark but receiving error as follows:

Caused by: com.microsoft.azure.storage.StorageException: One of the request inputs is not valid.
       at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:89)
       at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:305)
       at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:175)
       at com.microsoft.azure.storage.blob.CloudBlob.startCopy(CloudBlob.java:883)
       at com.microsoft.azure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:825)
       at org.apache.hadoop.fs.azure.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:399)
       at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2449)
       ... 22 more

 

The code used is :

def put_data_to_azure(self, df, fs_azure, fs_account_key, destination_path, file_format, repartition):  
self.code_log.info('in put_data_to_azure')
try:
self.sc._jsc.hadoopConfiguration().set("fs.azure", fs_azure)
self.sc._jsc.hadoopConfiguration().set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
self.sc._jsc.hadoopConfiguration().set("fs.azure.account.key.%s.blob.core.windows.net" % fs_azure,
fs_account_key)
df.repartition(repartition).write.format(file_format).save(destination_path)
except Exception as e:
error1 = str(e).splitlines()[:2]
exception = "Exception in put_data_to_azure: " + ''.join(error1)
raise ExceptionHandler(exception)

 The destination path of azure is ' wasbs://<container>@<storage account>.blob.core.windows.net/folder'

 

Jars used are hadoop-azure-2.7.0.jar and azure-storage-2.0.0.jar. I have used these as I found them to be stable versions and could read from azure successfully but could not write. I have also tried newer versions 

5 Replies

  • mfessalifi's avatar
    mfessalifi
    Copper Contributor

    Hello Ashwini_Akula ,

     

    Just to be sure, as Azure blob requires to install additional libraries for accessing data from it because it uses wasb/wasbs protocol.

    Have you add this libraries?

     

    NB : Wasbs protocol is just an extension built on top of the HDFS APIs. In order to access resources from azure blob you need to add built jar files, named hadoop-azure.jar and azure-storage.jar to spark-submit when you submitting a job.

     

    Regards,

    Faiçal

      • mfessalifi's avatar
        mfessalifi
        Copper Contributor

        Hi Ashwini_Akula,

         

        To eliminate Scala/Spark to Storage connection issues, can you test a simple connection?

         

        • scala> val df = spark.read.format("csv").option("inferSchema", "true").load("wasbs://CONTAINER_NAME@ACCOUNT_NAME.blob.core.windows.net/<Folder>/..")*
        • scala> df.show()

         

        Regards,

        Faiçal (MCT, Azure Expert & Team Leader)

Resources