I have the data in the dataframe as below:
datetime | userId | memberId | value |
2016-04-06 16:36:... | 1234 | 111 | 1
2016-04-06 17:35:... | 1234 | 222 | 5
2016-04-06 17:50:... | 1234 | 111 | 8
2016-04-06 18:36:... | 1234 | 222 | 9
2016-04-05 16:36:... | 4567 | 111 | 1
2016-04-06 17:35:... | 4567 | 222 | 5
2016-04-06 18:50:... | 4567 | 111 | 8
2016-04-06 19:36:... | 4567 | 222 | 9
I need to find the max(datetime) groupby userid,memberid. When I tried as below:
df2 = df.groupBy('userId','memberId').max('datetime')
I'm getting error as:
org.apache.spark.sql.AnalysisException: "datetime" is not a numeric
column. Aggregation function can only be applied on a numeric column.;
The output I desired is as follows:
userId | memberId | datetime
1234 | 111 | 2016-04-06 17:50:...
1234 | 222 | 2016-04-06 18:36:...
4567 | 111 | 2016-04-06 18:50:...
4567 | 222 | 2016-04-06 19:36:...
Can someone please help me how I get the max date among the given data using PySpark dataframes?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…