Update: RDD.toLocalIterator
method that appeared after the original answer has been written is a more efficient way to do the job. It uses runJob
to evaluate only a single partition on each step.
TL;DR And the original answer might give a rough idea how it works:
First of all, get the array of partition indexes:
val parts = rdd.partitions
Then create smaller rdds filtering out everything but a single partition. Collect the data from smaller rdds and iterate over values of a single partition:
for (p <- parts) {
val idx = p.index
val partRdd = rdd.mapPartitionsWithIndex(a => if (a._1 == idx) a._2 else Iterator(), true)
//The second argument is true to avoid rdd reshuffling
val data = partRdd.collect //data contains all values from a single partition
//in the form of array
//Now you can do with the data whatever you want: iterate, save to a file, etc.
}
I didn't try this code, but it should work. Please write a comment if it won't compile. Of cause, it will work only if the partitions are small enough. If they aren't, you can always increase the number of partitions with rdd.coalesce(numParts, true)
.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…