python - How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

Question

Welcome To Ask or Share your Answers For Others

python - How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example:

import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
vect.fit(string)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

outputs:

4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']

The punctuation is removed: how to include them as separate tokens?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:24:50+0000

You should specify a word tokenizer that considers any punctuation as a separate token when creating the sklearn.feature_extraction.text.CountVectorizer instance, using the tokenizer parameter.

For example, nltk.tokenize.TreebankWordTokenizer treats most punctuation characters as separate tokens:

import sklearn.feature_extraction.text
from nltk.tokenize import TreebankWordTokenizer

ngram_size = 4
string = ["I really like python, it's pretty awesome."]
vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), 
                                                 tokenizer=TreebankWordTokenizer().tokenize)
print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))

outputs:

4-grams: [u"'s pretty awesome .", u", it 's pretty", u'i really like python', 
          u"it 's pretty awesome", u'like python , it', u"python , it 's", 
          u'really like python ,']

Categories

python - How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

python - How to use sklearn's CountVectorizerand() to get ngrams that include any punctuation as separate tokens?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags