python - Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

Question

Welcome To Ask or Share your Answers For Others

python - Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

I am working on keyword extraction problem. Consider the very general case

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')

t = """Two Travellers, walking in the noonday sun, sought the shade of a widespreading tree to rest. As they lay looking up among the pleasant leaves, they saw that it was a Plane Tree.

"How useless is the Plane!" said one of them. "It bears no fruit whatever, and only serves to litter the ground with leaves."

"Ungrateful creatures!" said a voice from the Plane Tree. "You lie here in my cooling shade, and yet you say I am useless! Thus ungratefully, O Jupiter, do men receive their blessings!"

Our best blessings are often the least appreciated."""

tfs = tfidf.fit_transform(t.split(" "))
str = 'tree cat travellers fruit jupiter'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()

for col in response.nonzero()[1]:
    print(feature_names[col], ' - ', response[0, col])

and this gives me

  (0, 28)   0.443509712811
  (0, 27)   0.517461475101
  (0, 8)    0.517461475101
  (0, 6)    0.517461475101
tree  -  0.443509712811
travellers  -  0.517461475101
jupiter  -  0.517461475101
fruit  -  0.517461475101

which is good. For any new document that comes in, is there a way to get the top n terms with the highest tfidf score?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:43:17+0000

You have to do a little bit of a song and dance to get the matrices as numpy arrays instead, but this should do what you're looking for:

feature_array = np.array(tfidf.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]

n = 3
top_n = feature_array[tfidf_sorting][:n]

This gives me:

array([u'fruit', u'travellers', u'jupiter'], 
  dtype='<U13')

The argsort call is really the useful one, here are the docs for it. We have to do [::-1] because argsort only supports sorting small to large. We call flatten to reduce the dimensions to 1d so that the sorted indices can be used to index the 1d feature array. Note that including the call to flatten will only work if you're testing one document at at time.

Also, on another note, did you mean something like tfs = tfidf.fit_transform(t.split(" "))? Otherwise, each term in the multiline string is being treated as a "document". Using instead means that we are actually looking at 4 documents (one for each line), which makes more sense when you think about tfidf.

Categories

python - Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

python - Scikit Learn TfidfVectorizer : How to get top n terms with highest tf-idf score

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags