Arithmetic overflow error converting double to data type FLOAT due to NaN

Microsoft

Aug 17, 2020

NaN stands for Not a Number. This scenario was a customer trying to insert a parquet file into SQL, but he was not able to do it.

That's because of the following error: Error converting values NaN or Infinity to type 'FLOAT'. NaN and Infinity are not supported. This error was from SQL DW and I got when I tried to run a Select with openrowset pointing to that file.

For a Select with openrowset example, you can use this as reference: https://techcommunity.microsoft.com/t5/azure-synapse-analytics/synapse-studio-error-while-trying-to-read-data-from-storage/ba-p/1511965

If he tried to insert direct the file on SQL from the notebook the error was: HadoopSqlException: Arithmetic overflow error converting double to data type FLOAT.'

In summary there were values on the float columns which were causing the errors above. Those values where NaN values.

The example bellow is based on this piece of documentation:

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export

I added some customization and also I have some nice discussion with my colleague Diya Mothafar. So my colleague mentioned to do the same using Pandas which also valid. My demo will not use Pandas, but again it also does the job.

So the idea here is convert those NaN values into NULL and after that load into SQL Server, we can do this with spark notebooks.

First open Synapse Studio -> Notebook ->PySpark

Fig 1 PySpark

%%pyspark
from pyspark.sql.functions import col, when

data_path = spark.read.load('abfss://filesystemdatalake@mystorage.dfs.core.windows.net/test/filetest.snappy.parquet', format='parquet')

#here we handle the NaN values
data_path = data_path.replace(float('nan'), None)

data_path.createOrReplaceTempView("pysparkdftemptable")

Add a Scala cell into the Notebook to add some magic. Note the table will be created with the JOB you do not need to create in advance.

Fig 2 Add Cell

%%spark
val scala_df = spark.sqlContext.sql ("select * from pysparkdftemptable")
//scala_df.show(100)
scala_df.write.sqlanalytics("YourDatabaseName.dbo.PySparkTable", Constants.INTERNAL)

Your new cell should look like this

Fig 3 Cell

Once the Job is complete. You can check the results by opening SSMS and querying the table.

Note: If instead of handling the NaN columns you want to filter it. You can follow the example bellow:

PySpark example:

from pyspark.sql.functions import isnan

data_path.where (isnan(col("Column_name")))

That is it!

Liliam

UK Engineer

Updated Aug 18, 2020

Version 11.0

synapse spark

synapse studio

Liliam_C_Leme

Microsoft

Joined May 04, 2020

View Profile

Azure Synapse Analytics Blog

Follow this blog board to get notified when there's new activity

Blog Post

Arithmetic overflow error converting double to data type FLOAT due to NaN