So, I understand that in general one should use coalesce()
when:
the number of partitions decreases due to a filter
or some other operation that may result in reducing the original dataset (RDD, DF). coalesce()
is useful for running operations more efficiently after filtering down a large dataset.
I also understand that it is less expensive than repartition
as it reduces shuffling by moving data only if necessary. My problem is how to define the parameter that coalesce
takes (idealPartionionNo
). I am working on a project which was passed to me from another engineer and he was using the below calculation to compute the value of that parameter.
// DEFINE OPTIMAL PARTITION NUMBER
implicit val NO_OF_EXECUTOR_INSTANCES = sc.getConf.getInt("spark.executor.instances", 5)
implicit val NO_OF_EXECUTOR_CORES = sc.getConf.getInt("spark.executor.cores", 2)
val idealPartionionNo = NO_OF_EXECUTOR_INSTANCES * NO_OF_EXECUTOR_CORES * REPARTITION_FACTOR
This is then used with a partitioner
object:
val partitioner = new HashPartitioner(idealPartionionNo)
but also used with:
RDD.filter(x=>x._3<30).coalesce(idealPartionionNo)
Is this the right approach? What is the main idea behind the idealPartionionNo
value computation? What is the REPARTITION_FACTOR
? How do I generally work to define that?
Also, since YARN is responsible for identifying the available executors on the fly is there a way of getting that number (AVAILABLE_EXECUTOR_INSTANCES
) on the fly and use that for computing idealPartionionNo
(i.e. replace NO_OF_EXECUTOR_INSTANCES
with AVAILABLE_EXECUTOR_INSTANCES
)?
Ideally, some actual examples of the form:
- Here 's a dataset (size);
- Here's a number of transformations and possible reuses of an RDD/DF.
- Here is where you should repartition/coalesce.
- Assume you have
n
executors with m
cores and a partition factor equal to k
then:
- The ideal number of partitions would be ==> ???
Also, if you can refer me to a nice blog that explains these I would really appreciate it.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…