apache spark - Why does SparkContext.parallelize use memory of the driver?

Question

Welcome To Ask or Share your Answers For Others

apache spark - Why does SparkContext.parallelize use memory of the driver?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache spark - Why does SparkContext.parallelize use memory of the driver?

Now I have to create a parallelized collection using sc.parallelize() in pyspark (Spark 2.1.0).

The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.

It seems that the collection is still being kept in spark's memory of the master node after I parallelize it to each worker node. Here's an example of my code:

# my python code
sc = SparkContext()
a = [1.0] * 1000000000
rdd_a = sc.parallelize(a, 1000000)
sum = rdd_a.reduce(lambda x, y: x+y)

I've tried

del a

to destroy it, but it didn't work. The spark which is a java process is still using a lot of memory.

After I create rdd_a, how can I destroy a to free the master node's memory?

Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:23:29+0000

The job of the master is to coordinate the workers and to give a worker a new task once it has completed its current task. In order to do that, the master needs to keep track of all of the tasks that need to be done for a given calculation.

Now, if the input were a file, the task would simply look like "read file F from X to Y". But because the input was in memory to begin with, the task looks like 1,000 numbers. And given the master needs to keep track of all 1,000,000 tasks, that gets quite large.

Categories

apache spark - Why does SparkContext.parallelize use memory of the driver?

apache spark - Why does SparkContext.parallelize use memory of the driver?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags