Unable to write csv to azure blob storage using Pyspark

Question

Hi there,&nbsp;I am trying to write a csv to an azure blob storage using pyspark but receiving error as follows:Caused by: com.microsoft.azure.storage.StorageException: One of the request inputs is not valid.
       at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:89)
       at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:305)
       at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:175)
       at com.microsoft.azure.storage.blob.CloudBlob.startCopy(CloudBlob.java:883)
       at com.microsoft.azure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:825)
       at org.apache.hadoop.fs.azure.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:399)
       at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2449)
       ... 22 more&nbsp;The code used is :def put_data_to_azure(self, df, fs_azure, fs_account_key, destination_path, file_format, repartition):      self.code_log.info('in put_data_to_azure')    try:        self.sc._jsc.hadoopConfiguration().set("fs.azure", fs_azure)        self.sc._jsc.hadoopConfiguration().set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")        self.sc._jsc.hadoopConfiguration().set("fs.azure.account.key.%s.blob.core.windows.net" % fs_azure,                                               fs_account_key)        df.repartition(repartition).write.format(file_format).save(destination_path)    except Exception as e:        error1 = str(e).splitlines()[:2]        exception = "Exception in put_data_to_azure: " + ''.join(error1)        raise ExceptionHandler(exception)&nbsp;The destination path of azure is ' wasbs://&lt;container&gt;@&lt;storage account&gt;.blob.core.windows.net/folder'&nbsp;Jars used are hadoop-azure-2.7.0.jar and azure-storage-2.0.0.jar. I have used these as I found them to be stable versions and could read from azure successfully but could not write. I have also tried newer versions&nbsp;

mfessalifi · Answer

Hello&nbsp;Ashwini_Akula&nbsp;,&nbsp;Just to be sure, as&nbsp;Azure blob requires to install additional libraries for accessing data from it because it uses wasb/wasbs protocol.Have you add this libraries?&nbsp;NB : Wasbs protocol is just an extension built on top of the HDFS APIs. In order to access resources from azure blob you need to add built jar files, named hadoop-azure.jar and azure-storage.jar to spark-submit when you submitting a job.&nbsp;Regards,Faiçal

ashwini_akula · Answer

Hi&nbsp;Faiçal mfessalifi&nbsp;,&nbsp;Yes I have used those jars and yet I am facing the issue.&nbsp;

mfessalifi · Answer

Hi Ashwini_Akula,&nbsp;To eliminate Scala/Spark to Storage connection issues, can you test a simple connection?&nbsp;scala&gt; val df = spark.read.format("csv").option("inferSchema", "true").load("wasbs://CONTAINER_NAME@ACCOUNT_NAME.blob.core.windows.net/&lt;Folder&gt;/..")*scala&gt; df.show()&nbsp;Regards,Faiçal (MCT, Azure Expert &amp; Team Leader)

rajengla · Answer

mfessalifi&nbsp;I am facing the same issue as well. We are able to read from the Azure Blob storage. But facing the issue while writing the data using PySpsark.sc = pyspark.SparkContext.getOrCreate()
spark.sparkContext.setLogLevel('ERROR')

storage_account_access_key = "******"

blob_account_name = "blob_account_name"
blob_container_name = "blob_container_name"
spark.conf.set("spark.hadoop.fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set("fs.azure.account.key.%s.blob.core.windows.net" % (blob_account_name), storage_account_access_key)

sc._jsc.hadoopConfiguration().set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("fs.azure.account.key.%s.blob.core.windows.net" % (blob_account_name), storage_account_access_key)
sc._jsc.hadoopConfiguration().set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sc._jsc.hadoopConfiguration().set("spark.hadoop.fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
sqlContext = SQLContext(sc)

rel_input_path = "input_path"
input_path = "wasbs://%s@%s.blob.core.windows.net/%s" % (blob_container_name, blob_account_name, rel_input_path)
df = sqlContext.read.format("com.databricks.spark.csv").option("delimiter","~").option("header", "false").load(input_path)

rel_output_path = "output_path"
output_path = "wasbs://%s@%s.blob.core.windows.net/%s" % (blob_container_name, blob_account_name, rel_output_path)
df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").mode("overwrite").save(output_path)&nbsp;Please find attached a stack trace of the error.&nbsp;Caused by: org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: One of the request inputs is not valid.
        at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2482)
        at org.apache.hadoop.fs.azure.NativeAzureFileSystem$FolderRenamePending.execute(NativeAzureFileSystem.java:424)
        at org.apache.hadoop.fs.azure.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1997)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:531)
        at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:502)
        at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
        at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:77)
        at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:245)
        at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:79)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:275)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:281)
        ... 9 more
Caused by: com.microsoft.azure.storage.StorageException: One of the request inputs is not valid.
        at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:162)
        at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:307)
        at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:177)
        at com.microsoft.azure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:764)
        at org.apache.hadoop.fs.azure.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:399)
        at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2449)
        ... 20 more&nbsp;What might be the reason behind this error?

amrinder_singh · Answer

One thing to check is whether you are using a blob storage account or a ADLS Gen 2 (HNS) account. If you are making use of ADLS Gen2 kind try connecting with ABFS driver instead of WASBS driver.

Forum Discussion

Unable to write csv to azure blob storage using Pyspark

5 Replies

Resources