How to handle Pandas breaking change on version 0.14

Published Oct 12 2020 04:50 AM 1,068 Views
Microsoft

When you are using Pandas in a notebook you may notice there was a change in the Arrow IPC format from 0.15.1 onwards.

You can find more information here: https://arrow.apache.org/blog/2019/10/06/0.15.0-release/

 

So this a small post how to work around this:

 

The customer was using the example available here:

https://spark.apache.org/docs/2.4.0/sql-pyspark-pandas-with-arrow.html

Specifically this one:

 

 

 

 

import pandas as pd

from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType

# Declare the function and create the UDF
def multiply_func(a, b):
    return a * b

multiply = pandas_udf(multiply_func, returnType=LongType())

# The function for a pandas_udf should be able to execute with local Pandas data
x = pd.Series([1, 2, 3])
print(multiply_func(x, x))

# Create a Spark DataFrame, 'spark' is an existing SparkSession
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))

# Execute function as a Spark vectorized UDF
df.select(multiply(col("x"), col("x"))).show()

 

 

 

When tried to run in synapse it failed with the error:

Py4JJavaError : An error occurred while calling o205.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 4 times, most recent failure: Lost task 1.3 in stage 2.0 (TID 15, c671bd6ddc35b7487900238907316, executor 1): java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)

 

Synapse Spark uses the library documented here:

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-version-support

 

The workaround to use the the legacy Panda version is:

1) Add this to your code:

 

 

    import os
    os.environ ['ARROW_PRE_0_15_IPC_FORMAT=']='1'

 

 

and for that example, specific replace  show per display like:

 

 

#Instead of:
df.select(multiply(col("x"), col("x"))).show()

#use
display(df.select(multiply(col("x"), col("x")))) 

 

You can also replace the whl files:

 https://pypi.org/project/pyarrow/0.14.1/#files

Instructions here:

 https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-librari...

 

That is it!

Liliam 

UK Engineer

%3CLINGO-SUB%20id%3D%22lingo-sub-1752778%22%20slang%3D%22en-US%22%3EHow%20to%20handle%20Pandas%20breaking%20change%20on%20version%200.14%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-1752778%22%20slang%3D%22en-US%22%3E%3CP%3EWhen%20you%20are%20using%20Pandas%20in%20a%20notebook%20you%20may%20notice%20there%20%3CSPAN%3Ewas%20a%20change%20in%20the%20Arrow%20IPC%20format%20from%200.15.1%20onwards.%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%3CSPAN%3EYou%20can%20find%20more%20information%20here%3A%26nbsp%3B%3C%2FSPAN%3E%3CA%20href%3D%22https%3A%2F%2Farrow.apache.org%2Fblog%2F2019%2F10%2F06%2F0.15.0-release%2F%22%20target%3D%22_blank%22%20rel%3D%22noopener%20nofollow%20noreferrer%22%3Ehttps%3A%2F%2Farrow.apache.org%2Fblog%2F2019%2F10%2F06%2F0.15.0-release%2F%3C%2FA%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ESo%20this%20a%20small%20post%20how%20to%20work%20around%20this%3A%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EThe%20customer%20was%20using%20the%20example%20available%20here%3A%3C%2FP%3E%0A%3CP%3E%3CA%20href%3D%22https%3A%2F%2Fspark.apache.org%2Fdocs%2F2.4.0%2Fsql-pyspark-pandas-with-arrow.html%22%20target%3D%22_blank%22%20rel%3D%22noopener%20nofollow%20noreferrer%22%3Ehttps%3A%2F%2Fspark.apache.org%2Fdocs%2F2.4.0%2Fsql-pyspark-pandas-with-arrow.html%3C%2FA%3E%3C%2FP%3E%0A%3CP%3ESpecifically%20this%20one%3A%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CPRE%20class%3D%22lia-code-sample%20language-applescript%22%3E%3CCODE%3Eimport%20pandas%20as%20pd%0A%0Afrom%20pyspark.sql.functions%20import%20col%2C%20pandas_udf%0Afrom%20pyspark.sql.types%20import%20LongType%0A%0A%23%20Declare%20the%20function%20and%20create%20the%20UDF%0Adef%20multiply_func(a%2C%20b)%3A%0A%20%20%20%20return%20a%20*%20b%0A%0Amultiply%20%3D%20pandas_udf(multiply_func%2C%20returnType%3DLongType())%0A%0A%23%20The%20function%20for%20a%20pandas_udf%20should%20be%20able%20to%20execute%20with%20local%20Pandas%20data%0Ax%20%3D%20pd.Series(%5B1%2C%202%2C%203%5D)%0Aprint(multiply_func(x%2C%20x))%0A%0A%23%20Create%20a%20Spark%20DataFrame%2C%20'spark'%20is%20an%20existing%20SparkSession%0Adf%20%3D%20spark.createDataFrame(pd.DataFrame(x%2C%20columns%3D%5B%22x%22%5D))%0A%0A%23%20Execute%20function%20as%20a%20Spark%20vectorized%20UDF%0Adf.select(multiply(col(%22x%22)%2C%20col(%22x%22))).show()%3C%2FCODE%3E%3C%2FPRE%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EWhen%20tried%20to%20run%20in%20synapse%20it%20failed%20with%20the%20error%3A%3C%2FP%3E%0A%3CP%3E%3CEM%3E%3CFONT%20color%3D%22%23FF0000%22%3EPy4JJavaError%20%3A%20An%20error%20occurred%20while%20calling%20o205.showString.%20%3A%20org.apache.spark.SparkException%3A%20Job%20aborted%20due%20to%20stage%20failure%3A%20Task%201%20in%20stage%202.0%20failed%204%20times%2C%20most%20recent%20failure%3A%20Lost%20task%201.3%20in%20stage%202.0%20(TID%2015%2C%20c671bd6ddc35b7487900238907316%2C%20executor%201)%3A%20java.lang.IllegalArgumentException%20at%20java.nio.ByteBuffer.allocate(ByteBuffer.java%3A334)%3C%2FFONT%3E%3C%2FEM%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ESynapse%20Spark%20uses%20the%20library%20documented%20here%3A%3C%2FP%3E%0A%3CP%3E%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fsynapse-analytics%2Fspark%2Fapache-spark-version-support%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noreferrer%22%3Ehttps%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fsynapse-analytics%2Fspark%2Fapache-spark-version-support%3C%2FA%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EThe%20workaround%20to%20use%20the%20the%20legacy%20Panda%20version%20is%3A%3C%2FP%3E%0A%3CP%3E1)%20Add%20this%20to%20your%20code%3A%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CPRE%20class%3D%22lia-code-sample%20language-applescript%22%3E%3CCODE%3E%20%20%20%20import%20os%0A%20%20%20%20os.environ%20%5B'ARROW_PRE_0_15_IPC_FORMAT%3D'%5D%3D'1'%3C%2FCODE%3E%3C%2FPRE%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3Eand%20for%20that%20example%2C%20specific%20replace%26nbsp%3B%20show%20per%20display%20like%3A%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CPRE%20class%3D%22lia-code-sample%20language-applescript%22%3E%3CCODE%3E%23Instead%20of%3A%0Adf.select(multiply(col(%22x%22)%2C%20col(%22x%22))).show()%0A%0A%23use%0Adisplay(df.select(multiply(col(%22x%22)%2C%20col(%22x%22))))%20%3C%2FCODE%3E%3C%2FPRE%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EYou%20can%20also%20replace%20the%20whl%20files%3A%3C%2FP%3E%0A%3CP%3E%3CSPAN%3E%26nbsp%3B%3C%2FSPAN%3E%3CA%20href%3D%22https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps%253A%252F%252Fpypi.org%252Fproject%252Fpyarrow%252F0.14.1%252F%2523files%26amp%3Bdata%3D02%257C01%257Clilem%2540microsoft.com%257Ccbd142980449474274e008d86631320c%257C72f988bf86f141af91ab2d7cd011db47%257C1%257C0%257C637371708849148251%26amp%3Bsdata%3DLQZPrYK%252F6spJ70utJbmiJaOkvzhyoZjVb4OAdm1FL5s%253D%26amp%3Breserved%3D0%22%20target%3D%22_blank%22%20rel%3D%22noopener%20nofollow%20noreferrer%22%3Ehttps%3A%2F%2Fpypi.org%2Fproject%2Fpyarrow%2F0.14.1%2F%23files%3C%2FA%3E%3C%2FP%3E%0A%3CP%3EInstructions%20here%3A%3C%2FP%3E%0A%3CP%3E%3CSPAN%3E%26nbsp%3B%3C%2FSPAN%3E%3CA%20href%3D%22https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps%253A%252F%252Fdocs.microsoft.com%252Fen-us%252Fazure%252Fsynapse-analytics%252Fspark%252Fapache-spark-azure-portal-add-libraries%2523manage-a-python-wheel%26amp%3Bdata%3D02%257C01%257Clilem%2540microsoft.com%257Ccbd142980449474274e008d86631320c%257C72f988bf86f141af91ab2d7cd011db47%257C1%257C0%257C637371708849158238%26amp%3Bsdata%3DDl7ditAR23o5irFjPA%252FHrkKF%252B66JjfmWv8BkJNOJJvs%253D%26amp%3Breserved%3D0%22%20target%3D%22_blank%22%20rel%3D%22noopener%20nofollow%20noreferrer%22%3Ehttps%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fsynapse-analytics%2Fspark%2Fapache-spark-azure-portal-add-libraries%23manage-a-python-wheel%3C%2FA%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EThat%20is%20it!%3C%2FP%3E%0A%3CP%3ELiliam%26nbsp%3B%3C%2FP%3E%0A%3CP%3EUK%20Engineer%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-TEASER%20id%3D%22lingo-teaser-1752778%22%20slang%3D%22en-US%22%3E%3CP%3EWhen%20you%20are%20using%20Pandas%20in%20a%20notebook%20you%20may%20notice%20there%20%3CSPAN%3Ewas%20a%20change%20in%20the%20Arrow%20IPC%20format%20from%200.15.1%20onwards.%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%3CSPAN%3EError%3A%3CEM%3E%3CFONT%20color%3D%22%23FF0000%22%3Ejava.lang.IllegalArgumentException%3C%2FFONT%3E%3C%2FEM%3E%3C%2FSPAN%3E%3C%2FP%3E%3C%2FLINGO-TEASER%3E%3CLINGO-LABS%20id%3D%22lingo-labs-1752778%22%20slang%3D%22en-US%22%3E%3CLINGO-LABEL%3ESynapse%20Spark%3C%2FLINGO-LABEL%3E%3C%2FLINGO-LABS%3E
Version history
Last update:
‎Oct 12 2020 04:50 AM
Updated by: