apache spark - Why using a UDF in a SQL query leads to cartesian product?

Question

Welcome To Ask or Share your Answers For Others

apache spark - Why using a UDF in a SQL query leads to cartesian product?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - Why using a UDF in a SQL query leads to cartesian product?

I saw Databricks-Question and don't understand

Why using UDFs leads to a Cartesian product instead of a full outer join? Obviously the Cartesian product would be a lot more rows than a full outer join(Joins is an example) which is a potential performance hit.
Any way to force an outer join over the Cartesian product in the example given in Databricks-Question?

Quoting the Databricks-Question here:

I have a Spark Streaming application that uses SQLContext to execute SQL statements on streaming data. When I register a custom UDF in Scala, the performance of the streaming application degrades significantly. Details below:

Statement 1:

Select col1, col2 from table1 as t1 join table2 as t2 on t1.foo = t2.bar

Statement 2:

Select col1, col2 from table1 as t1 join table2 as t2 on equals(t1.foo,t2.bar)

I register a custom UDF using SQLContext as follows:

sqlc.udf.register("equals", (s1: String, s2:String) => s1 == s2)

On the same input and Spark configuration, Statement2 performance significantly worse(close to 100X) compared to Statement1.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T23:50:06+0000

Why using UDFs leads to a Cartesian product instead of a full outer join?

The reason why using UDFs require Cartesian product is quite simple. Since you pass an arbitrary function with possibly infinite domain and non-deterministic behavior the only way to determine its value is to pass arguments and evaluate. It means you simply have to check all possible pairs.

Simple equality from the other hand has a predictable behavior. If you use t1.foo = t2.bar condition you can simply shuffle t1 and t2 rows by foo and bar respectively to get expected result.

And just to be precise in the relational algebra outer join is actually expressed using natural join. Anything beyond that is simply an optimization.

Any way to force an outer join over the Cartesian product

Not really, unless you want to modify Spark SQL engine.

Categories

apache spark - Why using a UDF in a SQL query leads to cartesian product?

apache spark - Why using a UDF in a SQL query leads to cartesian product?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags