I've defined two tables like this:
val tableName = "table1"
val tableName2 = "table2"
val format = new SimpleDateFormat("yyyy-MM-dd")
val data = List(
List("mike", 26, true),
List("susan", 26, false),
List("john", 33, true)
)
val data2 = List(
List("mike", "grade1", 45, "baseball", new java.sql.Date(format.parse("1957-12-10").getTime)),
List("john", "grade2", 33, "soccer", new java.sql.Date(format.parse("1978-06-07").getTime)),
List("john", "grade2", 32, "golf", new java.sql.Date(format.parse("1978-06-07").getTime)),
List("mike", "grade2", 26, "basketball", new java.sql.Date(format.parse("1978-06-07").getTime)),
List("lena", "grade2", 23, "baseball", new java.sql.Date(format.parse("1978-06-07").getTime))
)
val rdd = sparkContext.parallelize(data).map(Row.fromSeq(_))
val rdd2 = sparkContext.parallelize(data2).map(Row.fromSeq(_))
val schema = StructType(Array(
StructField("name", StringType, true),
StructField("age", IntegerType, true),
StructField("isBoy", BooleanType, false)
))
val schema2 = StructType(Array(
StructField("name", StringType, true),
StructField("grade", StringType, true),
StructField("howold", IntegerType, true),
StructField("hobby", StringType, true),
StructField("birthday", DateType, false)
))
val df = sqlContext.createDataFrame(rdd, schema)
val df2 = sqlContext.createDataFrame(rdd2, schema2)
df.createOrReplaceTempView(tableName)
df2.createOrReplaceTempView(tableName2)
I'm trying to build query to return rows from table1 that doesn't have matching row in table2.
I've tried to do it using this query:
Select * from table1 LEFT JOIN table2 ON table1.name = table2.name AND table1.age = table2.howold AND table2.name IS NULL AND table2.howold IS NULL
but this just gives me all rows from table1:
List({"name":"john","age":33,"isBoy":true},
{"name":"susan","age":26,"isBoy":false},
{"name":"mike","age":26,"isBoy":true})
How to make this type of join in Spark efficiently?
I'm looking for an SQL query because I need to be able to specify columns which to compare between two tables, not just compare row by row like it is done in other recommended questions. Like using subtract, except etc.
See Question&Answers more detail:
os