The following question is specific to version 1.5.0 of PySpark, as new features are being constantly added to PySpark.
How do you list all items of column y
based on the values of column x
?
For example:
rdd = sc.parallelize([ {'x': "foo", 'y': 1},
{'x': "foo", 'y': 1},
{'x': "bar", 'y': 10},
{'x': "bar", 'y': 2},
{'x': 'qux', 'y':999}])
df = sqlCtx.createDataFrame(rdd)
df.show()
+---+---+
| x| y|
+---+---+
|foo| 1|
|foo| 1|
|bar| 10|
|bar| 2|
|qux|999|
+---+---+
I would like to have something like:
+---+--------+
| x| y |
+---+--------+
|foo| [1, 1] |
|bar| [10, 2]|
|bar| [999] |
+---+--------+
The order doesn't matter. In Pandas, I can achieve this usign groupby:
pd = df.toPandas()
pd.groupby('x')['y'].apply(list).reset_index()
However, groupBy
aggregation functionality in ver 1.5.0 seems to be very limited. Any idea how to overcome this limitation?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…