python - Retrieve top n in each group of a DataFrame in pyspark

Question

Welcome To Ask or Share your Answers For Others

python - Retrieve top n in each group of a DataFrame in pyspark

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Retrieve top n in each group of a DataFrame in pyspark

There's a DataFrame in pyspark with data as below:

user_id object_id score
user_1  object_1  3
user_1  object_1  1
user_1  object_2  2
user_2  object_1  5
user_2  object_2  2
user_2  object_2  6

What I expect is returning 2 records in each group with the same user_id, which need to have the highest score. Consequently, the result should look as the following:

user_id object_id score
user_1  object_1  3
user_1  object_2  2
user_2  object_2  6
user_2  object_1  5

I'm really new to pyspark, could anyone give me a code snippet or portal to the related documentation of this problem? Great thanks!

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T22:33:22+0000

I believe you need to use window functions to attain the rank of each row based on user_id and score, and subsequently filter your results to only keep the first two values.

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col

window = Window.partitionBy(df['user_id']).orderBy(df['score'].desc())

df.select('*', rank().over(window).alias('rank')) 
  .filter(col('rank') <= 2) 
  .show() 
#+-------+---------+-----+----+
#|user_id|object_id|score|rank|
#+-------+---------+-----+----+
#| user_1| object_1|    3|   1|
#| user_1| object_2|    2|   2|
#| user_2| object_2|    6|   1|
#| user_2| object_1|    5|   2|
#+-------+---------+-----+----+

In general, the official programming guide is a good place to start learning Spark.

Data

rdd = sc.parallelize([("user_1",  "object_1",  3), 
                      ("user_1",  "object_2",  2), 
                      ("user_2",  "object_1",  5), 
                      ("user_2",  "object_2",  2), 
                      ("user_2",  "object_2",  6)])
df = sqlContext.createDataFrame(rdd, ["user_id", "object_id", "score"])

Categories

python - Retrieve top n in each group of a DataFrame in pyspark

python - Retrieve top n in each group of a DataFrame in pyspark

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Data

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags