python - Pyspark Dataframe Join using UDF

Question

Welcome To Ask or Share your Answers For Others

python - Pyspark Dataframe Join using UDF

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Pyspark Dataframe Join using UDF

I'm trying to create a custom join for two dataframes (df1 and df2) in PySpark (similar to this), with code that looks like this:

my_join_udf = udf(lambda x, y: isJoin(x, y), BooleanType())
my_join_df = df1.join(df2, my_join_udf(df1.col_a, df2.col_b))

The error message I'm getting is:

java.lang.RuntimeException: Invalid PythonUDF PythonUDF#<lambda>(col_a#17,col_b#0), requires attributes from more than one child

Is there a way to write a PySpark UDF that can process columns from two separate dataframes?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:30:46+0000

Spark 2.2+

You have to use crossJoin or enable cross joins in the configuration:

df1.crossJoin(df2).where(my_join_udf(df1.col_a, df2.col_b))

Spark 2.0, 2.1

Method shown below doesn't work anymore in Spark 2.x. See SPARK-19728.

Spark 1.x

Theoretically you can join and filter:

df1.join(df2).where(my_join_udf(df1.col_a, df2.col_b))

but in general you shouldn't to it all. Any type of join which is not based on equality requires a full Cartesian product (same as the answer) which is rarely acceptable (see also Why using a UDF in a SQL query leads to cartesian product?).

Categories

python - Pyspark Dataframe Join using UDF

python - Pyspark Dataframe Join using UDF

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags