I'm trying to create a custom join for two dataframes (df1 and df2) in PySpark (similar to this), with code that looks like this:
my_join_udf = udf(lambda x, y: isJoin(x, y), BooleanType())
my_join_df = df1.join(df2, my_join_udf(df1.col_a, df2.col_b))
The error message I'm getting is:
java.lang.RuntimeException: Invalid PythonUDF PythonUDF#<lambda>(col_a#17,col_b#0), requires attributes from more than one child
Is there a way to write a PySpark UDF that can process columns from two separate dataframes?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…