I have a doubt, about when we broadcast a dataframe.
Copies of broadcasted dataframe are sent to each Executor.
So, when does Spark evict these copies from each Executor ?
I find this topic functionally easy to understand, but the manuals harder to follow technically and there are improvements always in the offing.
My take:
There is a ContextCleaner that is running on the Driver for every Spark App. It gets created immediately started when the SparkContext commences. It is more about all sorts of objects in Spark. The ContextCleaner thread cleans RDD, shuffle, and broadcast states, Accumulators using keepCleaning method that runs always from this class. It decides which objects needs eviction due to no longer being referenced and these get placed on a list. It calls various methods, such as registerShuffleForCleanup. That is to say a check is made to see if there are no alive root objects pointing to a given object; if so, then that object is eligible for clean-up, eviction. context-cleaner-periodic-gc asynchronously requests the standard JVM garbage collector. Periodic runs of this are started when ContextCleaner starts and stopped when ContextCleaner terminates. Spark makes use of the standard Java GC.
ContextCleaner
SparkContext
keepCleaning
registerShuffleForCleanup
alive root
context-cleaner-periodic-gc
This https://mallikarjuna_g.gitbooks.io/spark/content/spark-service-contextcleaner.html is a good reference next to the Spark official docs.
1.4m articles
1.4m replys
5 comments
57.0k users