Presenting here before possibly filing a bug. I'm using Spark 1.6.0.
This is a simplified version of the problem I'm dealing with. I've filtered a table, and then I'm trying to do a left outer join with that subset and the main table, matching all the columns.
I've only got 2 rows in the main table and one in the filtered table. I'm expecting the resulting table to only have the single row from the subset.
scala> val b = Seq(("a", "b", 1), ("a", "b", 2)).toDF("a", "b", "c")
b: org.apache.spark.sql.DataFrame = [a: string, b: string, c: int]
scala> val a = b.where("c = 1").withColumnRenamed("a", "filta").withColumnRenamed("b", "filtb")
a: org.apache.spark.sql.DataFrame = [filta: string, filtb: string, c: int]
scala> a.join(b, $"filta" <=> $"a" and $"filtb" <=> $"b" and a("c") <=> b("c"), "left_outer").show
+-----+-----+---+---+---+---+
|filta|filtb| c| a| b| c|
+-----+-----+---+---+---+---+
| a| b| 1| a| b| 1|
| a| b| 1| a| b| 2|
+-----+-----+---+---+---+---+
I didn't expect that result at all. I expected the first row, but not the second. I suspected it's the null-safe equality, so I tried it without.
scala> a.join(b, $"filta" === $"a" and $"filtb" === $"b" and a("c") === b("c"), "left_outer").show
16/03/21 12:50:00 WARN Column: Constructing trivially true equals predicate, 'c#18232 = c#18232'. Perhaps you need to use aliases.
+-----+-----+---+---+---+---+
|filta|filtb| c| a| b| c|
+-----+-----+---+---+---+---+
| a| b| 1| a| b| 1|
+-----+-----+---+---+---+---+
OK, that's the result I expected, but then I got suspicious of the warning. There is a separate StackOverflow question to deal with that warning here: Spark SQL performing carthesian join instead of inner join
So I create a new column that avoids the warning.
scala> a.withColumn("newc", $"c").join(b, $"filta" === $"a" and $"filtb" === $"b" and $"newc" === b("c"), "left_outer").show
+-----+-----+---+----+---+---+---+
|filta|filtb| c|newc| a| b| c|
+-----+-----+---+----+---+---+---+
| a| b| 1| 1| a| b| 1|
| a| b| 1| 1| a| b| 2|
+-----+-----+---+----+---+---+---+
But now the result is wrong again!
I have a lot of null-safe equality checks, and the warning isn't fatal, so I don't see a clear path to working with/around this.
Is the behaviour a bug, or is this expected behaviour? If expected, why?
See Question&Answers more detail:
os