Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.3k views
in Technique[技术] by (71.8m points)

Python remove stop words from pandas dataframe

I want to remove the stop words from my column "tweets". How do I iterative over each row and each item?

pos_tweets = [('I love this car', 'positive'),
    ('This view is amazing', 'positive'),
    ('I feel great this morning', 'positive'),
    ('I am so excited about the concert', 'positive'),
    ('He is my best friend', 'positive')]

test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]
test["tweet"] = test["tweet"].str.lower().str.split()

from nltk.corpus import stopwords
stop = stopwords.words('english')
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

We can import stopwords from nltk.corpus as below. With that, We exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.

# Import stopwords with nltk.
from nltk.corpus import stopwords
stop = stopwords.words('english')

pos_tweets = [('I love this car', 'positive'),
    ('This view is amazing', 'positive'),
    ('I feel great this morning', 'positive'),
    ('I am so excited about the concert', 'positive'),
    ('He is my best friend', 'positive')]

test = pd.DataFrame(pos_tweets)
test.columns = ["tweet","class"]

# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test)
# Out[40]:
#                                tweet     class tweet_without_stopwords
# 0                    I love this car  positive              I love car
# 1               This view is amazing  positive       This view amazing
# 2          I feel great this morning  positive    I feel great morning
# 3  I am so excited about the concert  positive       I excited concert
# 4               He is my best friend  positive          He best friend

It can also be excluded by using pandas.Series.str.replace.

pat = r'(?:{})'.format('|'.join(stop))
test['tweet_without_stopwords'] = test['tweet'].str.replace(pat, '')
test['tweet_without_stopwords'] = test['tweet_without_stopwords'].str.replace(r's+', ' ')
# Same results.
# 0              I love car
# 1       This view amazing
# 2    I feel great morning
# 3       I excited concert
# 4          He best friend

If you can not import stopwords, you can download as follows.

import nltk
nltk.download('stopwords')

Another way to answer is to import text.ENGLISH_STOP_WORDS from sklearn.feature_extraction.

# Import stopwords with scikit-learn
from sklearn.feature_extraction import text
stop = text.ENGLISH_STOP_WORDS

Notice that the number of words in the scikit-learn stopwords and nltk stopwords are different.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...