cv.vocabulary_
in this instance is a dict, where the keys are the words (features) that you've found and the values are indices, which is why they're 0, 1, 2, 3
. It's just bad luck that it looked similar to your counts :)
You need to work with the cv_fit
object to get the counts
from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)
print(cv.get_feature_names())
print(cv_fit.toarray())
#['bird', 'cat', 'dog', 'fish']
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]
Each row in the array is one of your original documents (strings), each column is a feature (word), and the element is the count for that particular word and document. You can see that if you sum each column you'll get the correct number
print(cv_fit.toarray().sum(axis=0))
#[2 3 2 2]
Honestly though, I'd suggest using collections.Counter
or something from NLTK, unless you have some specific reason to use scikit-learn, as it'll be simpler.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…