Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
404 views
in Technique[技术] by (71.8m points)

python - Get the max value for each key in a Spark RDD

What is the best way to return the max row (value) associated with each unique key in a spark RDD?

I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?

I have in RDD format:

[(v, 3),
 (v, 1),
 (v, 1),
 (w, 7),
 (w, 1),
 (x, 3),
 (y, 1),
 (y, 1),
 (y, 2),
 (y, 3)]

And I need to return:

[(v, 3),
 (w, 7),
 (x, 3),
 (y, 3)]

Ties can return the first value or random.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:

(Scala)

val grouped = rdd.reduceByKey(math.max(_, _))

(Python)

grouped = rdd.reduceByKey(max)

(Java 7)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    new Function2<Integer, Integer, Integer>() {
        public Integer call(Integer v1, Integer v2) {
            return Math.max(v1, v2);
    }
});

(Java 8)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    (v1, v2) -> Math.max(v1, v2)
);

API doc for reduceByKey:


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...