Assuming I am having the following RDD:
rdd = sc.parallelize([('a', (5,1)), ('d', (8,2)), ('2', (6,3)), ('a', (8,2)), ('d', (9,6)), ('b', (3,4)),('c', (8,3))])
How can I use repartitionAndSortWithinPartitions
and sort by x[0] and after x[1][0]. Using the following I sort only by the key(x[0]):
Npartitions = sc.defaultParallelism
rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: hash(x) % Npartitions, 2)
A way to do it is the following but there should something more simple I guess:
Npartitions = sc.defaultParallelism
partitioned_data = rdd
.partitionBy(2)
.map(lambda x:(x[0],x[1][0],x[1][1]))
.toDF(['letter','number2','number3'])
.sortWithinPartitions(['letter','number2'],ascending=False)
.map(lambda x:(x.letter,(x.number2,x.number3)))
>>> partitioned_data.glom().collect()
[[],
[(u'd', (9, 6)), (u'd', (8, 2))],
[(u'c', (8, 3)), (u'c', (6, 3))],
[(u'b', (3, 4))],
[(u'a', (8, 2)), (u'a', (5, 1))]
As it can be seen I have to convert it to Dataframe in order to use sortWithinPartitions
. Is there another way? Using repartitionAndSortWIthinPartitions
?
(It doesnt matter that the data is not globally sorted. I care only to be sorted inside the partitions.)
See Question&Answers more detail:
os