apache spark - Why is pyspark picking up a variable that was not broadcast?

Question

Welcome To Ask or Share your Answers For Others

apache spark - Why is pyspark picking up a variable that was not broadcast?

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

apache spark - Why is pyspark picking up a variable that was not broadcast?

I'm using pyspark to analyse a dataset and I'm a little bit surprised as to why the following code works correctly even though I'm using a variable that was not broadcast.

The variable in question is video, that's used in function filter, after the join.

seed = random.randint(0,999)

# df is a dataframe
# video is just one randomly sampled element
video = df.sample(False,0.001,seed).head()

# just a python list
otherVideos = [ (22,0.32),(213,0.43) ]

# transform the python list into an rdd 
resultsRdd = sc.parallelize(similarVideos)

rdd = df.rdd.map(lambda row: (row.video_id,row.title))

# perform a join between resultsRdd and rdd
# note that video.title was NOT broadcast
(resultsRdd
   .join(rdd)
   .filter(lambda pair: pair[1][1] != video.title) # HERE!!!
   .takeOrdered(10, key= lambda pair: -pair[1][0]))

I'm using pyspark in standalone mode, with the following arguments to pyspark-submit:

--num-executors 12 --executor-cores 4 --executor-memory 1g --master local[*]

Also, I'm running the previous code on jupyter (new ipython-notebooks).

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:05:51+0000

[Reposting comment as an answer.]

For this concept, I think this link on understanding closures is a pretty good read. Essentially, you do not need to broadcast all variables outside the scope of an RDD since the closure (in your case video) will be serialized and sent to each executor and task for access during task execution. Broadcast variables are useful when the dataset being broadcast is large because it will exist as a read-only cache that will sit on the executor and not be serialized/sent/deserialized with each task run on that executor.

Categories

apache spark - Why is pyspark picking up a variable that was not broadcast?

apache spark - Why is pyspark picking up a variable that was not broadcast?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags