python - How do I get word frequency in a corpus using Scikit Learn CountVectorizer?

Question

Welcome To Ask or Share your Answers For Others

python - How do I get word frequency in a corpus using Scikit Learn CountVectorizer?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How do I get word frequency in a corpus using Scikit Learn CountVectorizer?

I'm trying to compute a simple word frequency using scikit-learn's CountVectorizer.

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird","bird"]
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print cv.vocabulary_
{u'bird': 0, u'cat': 1, u'dog': 2, u'fish': 3}

I was expecting it to return {u'bird': 2, u'cat': 3, u'dog': 2, u'fish': 2}.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:29:35+0000

cv.vocabulary_ in this instance is a dict, where the keys are the words (features) that you've found and the values are indices, which is why they're 0, 1, 2, 3. It's just bad luck that it looked similar to your counts :)

You need to work with the cv_fit object to get the counts

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())
#['bird', 'cat', 'dog', 'fish']
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

Each row in the array is one of your original documents (strings), each column is a feature (word), and the element is the count for that particular word and document. You can see that if you sum each column you'll get the correct number

print(cv_fit.toarray().sum(axis=0))
#[2 3 2 2]

Honestly though, I'd suggest using collections.Counter or something from NLTK, unless you have some specific reason to use scikit-learn, as it'll be simpler.

Categories

python - How do I get word frequency in a corpus using Scikit Learn CountVectorizer?

python - How do I get word frequency in a corpus using Scikit Learn CountVectorizer?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags