Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
2.0k views
in Technique[技术] by (71.8m points)

apache spark - PySpark - get row number for each row in a group

Using pyspark, I'd like to be able to group a spark dataframe, sort the group, and then provide a row number. So

Group    Date
  A      2000
  A      2002
  A      2007
  B      1999
  B      2015

Would become

Group    Date    row_num
  A      2000      0
  A      2002      1
  A      2007      2
  B      1999      0
  B      2015      1
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Use window function:

from pyspark.sql.window import *
from pyspark.sql.functions import row_number

df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...