python - In PySpark 1.5.0, how do you list all items of column `y` based on the values of column `x`?

Question

Welcome To Ask or Share your Answers For Others

python - In PySpark 1.5.0, how do you list all items of column `y` based on the values of column `x`?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - In PySpark 1.5.0, how do you list all items of column `y` based on the values of column `x`?

The following question is specific to version 1.5.0 of PySpark, as new features are being constantly added to PySpark.

How do you list all items of column y based on the values of column x? For example:

rdd = sc.parallelize([ {'x': "foo", 'y': 1}, 
                  {'x': "foo", 'y': 1}, 
                  {'x': "bar", 'y': 10}, 
                 {'x': "bar", 'y': 2},
                 {'x': 'qux', 'y':999}])
df = sqlCtx.createDataFrame(rdd)
df.show()

+---+---+
|  x|  y|
+---+---+
|foo|  1|
|foo|  1|
|bar| 10|
|bar|  2|
|qux|999|
+---+---+

I would like to have something like:

+---+--------+
|  x|  y     |
+---+--------+
|foo| [1, 1] |
|bar| [10, 2]|
|bar| [999]  |
+---+--------+

The order doesn't matter. In Pandas, I can achieve this usign groupby:

pd = df.toPandas()
pd.groupby('x')['y'].apply(list).reset_index()

However, groupBy aggregation functionality in ver 1.5.0 seems to be very limited. Any idea how to overcome this limitation?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:17:15+0000

You can use collect_list Hive UDAF:

from pyspark.sql.functions import expr
from pyspark import HiveContext

sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame(rdd)

df.groupBy("x").agg(expr("collect_list(y) AS y"))

In 1.6 or later you can use collect_list function:

from pyspark.sql.functions import collect_list

df.groupBy("x").agg(collect_list(y).alias("y"))

and in 2.0 or later you can use it without Hive support.

This is not a particularly efficient operation though so you should use it with moderation.

Also, don't use dictionaries for schema inference. It's been deprecated since 1.2

Categories

python - In PySpark 1.5.0, how do you list all items of column `y` based on the values of column `x`?

python - In PySpark 1.5.0, how do you list all items of column `y` based on the values of column `x`?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags