Actually, while take
is not entirely parallel, it's not entirely sequential either.
For example let's say you take(200)
, and each partition has 10 elements. take
will first fetch partition 0 and see that it has 10 elements. It assumes that it would need 20 such partitions to get 200 elements. But it's better to ask for a bit more in a parallel request. So it wants 30 partitions, and it already has 1. So it fetches partitions 1 to 29 next, in parallel. This will likely be the last step. If it's very unlucky, and does not find a total of 200 elements, it will again make an estimate and request another batch in parallel.
Check out the code, it's well documented:
https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1049
I think the documentation is wrong. Local calculation only happens when a single partition is required. This is the case in the first pass (fetching partition 0), but typically not the case in later passes.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…