Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
772 views
in Technique[技术] by (71.8m points)

scala - Difference between sc.broadcast and broadcast function in spark sql

I have used sc.broadcast for lookup files to improve the performance.

I also came to know there is a function called broadcast in Spark SQL Functions.

What is the difference between two?

Which one i should use it for broadcasting the reference/look up tables?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

one word answer :

1) org.apache.spark.sql.functions.broadcast() function is user supplied,explicit hint for given sql join.

2) sc.broadcast is for broadcasting readonly shared variable.


More details about broadcast function #1 :

Here is scala doc from sql/execution/SparkStrategies.scala

which says.

  • Broadcast: if one side of the join has an estimated physical size that is smaller than the * user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold * or if that side has an explicit broadcast hint (e.g. the user applied the *
    [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that side * of the join will be broadcasted and the other side will be streamed, with no shuffling *
    performed. If both sides of the join are eligible to be broadcasted then the *
  • Shuffle hash join: if the average size of a single partition is small enough to build a hash * table.
  • Sort merge: if the matching join keys are sortable.
  • If there is no joining keys, Join implementations are chosen with the following precedence:
    • BroadcastNestedLoopJoin: if one side of the join could be broadcasted
    • CartesianProduct: for Inner join
    • BroadcastNestedLoopJoin
  • The below method controls the behavior based on size we set to spark.sql.autoBroadcastJoinThreshold by default it is 10mb

Note : smallDataFrame.join(largeDataFrame) does not do a broadcast hash join, but largeDataFrame.join(smallDataFrame) does.

/** Matches a plan whose output should be small enough to be used in broadcast join.
         **/
        private def canBroadcast(plan: LogicalPlan): Boolean = {
          plan.statistics.isBroadcastable ||
            plan.statistics.sizeInBytes <= conf.autoBroadcastJoinThreshold
        }

In future the below configurations will be deprecated in coming versions of spark. enter image description here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...