apache spark - Using Scala classes as UDF with pyspark

Question

Welcome To Ask or Share your Answers For Others

apache spark - Using Scala classes as UDF with pyspark

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - Using Scala classes as UDF with pyspark

I'm trying to offload some computations from Python to Scala when using Apache Spark. I would like to use the class interface from Java to be able to use a persistent variable, like so (this is a nonsensical MWE based on my more complex use case):

package mwe

import org.apache.spark.sql.api.java.UDF1

class SomeFun extends UDF1[Int, Int] {
  private var prop: Int = 0

  override def call(input: Int): Int = {
    if (prop == 0) {
      prop = input
    }
    prop + input
  }
}

Now I'm attempting to use this class from within pyspark:

import pyspark
from pyspark.sql import SQLContext
from pyspark import SparkContext

conf = pyspark.SparkConf()
conf.set("spark.jars", "mwe.jar")
sc = SparkContext.getOrCreate(conf)

sqlContext = SQLContext.getOrCreate(sc)
sqlContext.registerJavaFunction("fun", "mwe.SomeFun")

df0 = sc.parallelize((i,) for i in range(6)).toDF(["num"])
df1 = df0.selectExpr("fun(num) + 3 as new_num")
df1.show()

And get the following exception:

pyspark.sql.utils.AnalysisException: u"cannot resolve '(UDF:fun(num) + 3)' due to data type mismatch: differing types in '(UDF:fun(num) + 3)' (struct<> and int).; line 1 pos 0;
'Project [(UDF:fun(num#0L) + 3) AS new_num#2]
+- AnalysisBarrier
      +- LogicalRDD [num#0L], false
"

What is the correct way to implement this? Will I have to resort to Java itself for the class? I'd greatly appreciate hints!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:29:46+0000

The source of the exception is usage of incompatible types:

First of all o.a.s.sql.api.java.UDF* objects require external Java (not Scala types), so UDF expecting integers should take boxed Integer (java.lang.Integer) not Int.
```
class SomeFun extends UDF1[Integer, Integer] {
  ...
  override def call(input: Integer): Integer = {
    ...
```

Unless you use legacy Python num column uses of LongType not IntegerType:

df0.printSchema()
root
 |-- num: long (nullable = true)

So the actual signature should be

class SomeFun extends UDF1[java.lang.Long, java.lang.Long] {
  ...
  override def call(input: java.lang.Long): java.lang.Long = {
    ...

or data should be casted before applying UDF

df0.selectExpr("fun(cast(num as integer)) + 3 as new_num")

Finally mutable state is not allowed in UDFs. It won't cause an exception but overall behavior will be non-deterministic.

Categories

apache spark - Using Scala classes as UDF with pyspark

apache spark - Using Scala classes as UDF with pyspark

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags