Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
198 views
in Technique[技术] by (71.8m points)

python - Pyspark Categorical data vectorization with numerical values associated with it

I'm a newbie in Pyspark programming. I need some help.

I have a dataset with a categorical feature and some associated numerical values with it. I would like to vectorize the categorical value including the associated numerical value with it. I have ~3 Million possible values for the categorical data column.

enter image description here

question from:https://stackoverflow.com/questions/65837384/pyspark-categorical-data-vectorization-with-numerical-values-associated-with-it

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can group by UserID and aggregate the Quantity column into an array:

import pyspark.sql.functions as F

df2 = df.groupBy('UserID').agg(F.collect_list('Quantity').alias('Quantity'))

But this may not ensure that the order of fruits remains correct. To achieve that, you can use a more sophisticated method that involves sorting:

df2 = df.groupBy('UserID').agg(
    F.expr("transform(array_sort(collect_list(array(`Fruit Purchased`, Quantity))), x -> x[1]) Quantity")
)

Or you can do a pivot instead, which also ensures order of fruits:

df2 = df.groupBy('UserID').pivot('Fruit Purchased').agg(F.first('Quantity'))
df3 = df2.select('UserID', F.array([c for c in df2.columns[1:]]).alias('Quantity'))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...