Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
577 views
in Technique[技术] by (71.8m points)

python - Count distinct words from a Pandas Data Frame

I've a Pandas data frame, where one column contains text. I'd like to get a list of unique words appearing across the entire column (space being the only split).

import pandas as pd

r1=['My nickname is ft.jgt','Someone is going to my place']

df=pd.DataFrame(r1,columns=['text'])

The output should look like this:

['my','nickname','is','ft.jgt','someone','going','to','place']

It wouldn't hurt to get a count as well, but it is not required.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Use a set to create the sequence of unique elements.

Do some clean-up on df to get the strings in lower case and split:

df['text'].str.lower().str.split()
Out[43]: 
0             [my, nickname, is, ft.jgt]
1    [someone, is, going, to, my, place]

Each list in this column can be passed to set.update function to get unique values. Use apply to do so:

results = set()
df['text'].str.lower().str.split().apply(results.update)
print(results)

set(['someone', 'ft.jgt', 'my', 'is', 'to', 'going', 'place', 'nickname'])

Or use with Counter() from comments:

from collections import Counter
results = Counter()
df['text'].str.lower().str.split().apply(results.update)
print(results)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...