python - Combining text stemming and removal of punctuation in NLTK and scikit-learn

Question

Welcome To Ask or Share your Answers For Others

python - Combining text stemming and removal of punctuation in NLTK and scikit-learn

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Combining text stemming and removal of punctuation in NLTK and scikit-learn

I am using a combination of NLTK and scikit-learn's CountVectorizer for stemming words and tokenization.

Below is an example of the plain usage of the CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizer

vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)

sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

Which will print

Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]

Now, let's say I want to remove stop words and stem the words. One option would be to do it like so:

from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer

#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
######## 

vect = CountVectorizer(tokenizer=tokenize, stop_words='english') 

vect.fit(vocab)

sentence1 = vect.transform(['The swimmer likes swimming.'])
sentence2 = vect.transform(['The swimmer swims.'])

print('Vocabulary: %s' %vect.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())

Which prints:

Vocabulary: ['.', 'like', 'swim', 'swimmer']
Sentence 1: [[1 1 1 1]]
Sentence 2: [[1 0 1 1]]

But how would I best get rid of the punctuation characters in this second version?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:45:36+0000

There are several options, try remove the punctuation before tokenization. But this would mean that don't -> dont

import string

def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

Or try removing punctuation after tokenization.

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [i for i in tokens if i not in string.punctuation]
    stems = stem_tokens(tokens, stemmer)
    return stems

EDITED

The above code will work but it's rather slow because it's looping through the same text multiple times:

Once to remove punctuation
Second time to tokenize
Third time to stem.

If you have more steps like removing digits or removing stopwords or lowercasing, etc.

It would be better to lump the steps together as much as possible, here's several better answers that is more efficient if your data requires more pre-processing steps:

Categories

python - Combining text stemming and removal of punctuation in NLTK and scikit-learn

python - Combining text stemming and removal of punctuation in NLTK and scikit-learn

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

EDITED

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags