I'm using pyspark to analyse a dataset and I'm a little bit surprised as to why the following code works correctly even though I'm using a variable that was not broadcast.
The variable in question is video
, that's used in function filter
, after the join.
seed = random.randint(0,999)
# df is a dataframe
# video is just one randomly sampled element
video = df.sample(False,0.001,seed).head()
# just a python list
otherVideos = [ (22,0.32),(213,0.43) ]
# transform the python list into an rdd
resultsRdd = sc.parallelize(similarVideos)
rdd = df.rdd.map(lambda row: (row.video_id,row.title))
# perform a join between resultsRdd and rdd
# note that video.title was NOT broadcast
(resultsRdd
.join(rdd)
.filter(lambda pair: pair[1][1] != video.title) # HERE!!!
.takeOrdered(10, key= lambda pair: -pair[1][0]))
I'm using pyspark in standalone mode, with the following arguments to pyspark-submit:
--num-executors 12 --executor-cores 4 --executor-memory 1g --master local[*]
Also, I'm running the previous code on jupyter (new ipython-notebooks).
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…