What is the best way to return the max row (value) associated with each unique key in a spark RDD?
I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?
I have in RDD format:
[(v, 3),
(v, 1),
(v, 1),
(w, 7),
(w, 1),
(x, 3),
(y, 1),
(y, 1),
(y, 2),
(y, 3)]
And I need to return:
[(v, 3),
(w, 7),
(x, 3),
(y, 3)]
Ties can return the first value or random.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…