I searched a lot for understanding this but I am not able to. I understand that by default TfidfVectorizer will apply l2
normalization on term frequency. This article explain the equation of it. I am using TfidfVectorizer on my text written in Gujarati language. Following is details of output about it:
My two documents are:
??? ???? ??? ??
??? ????? ??
The code I am using is:
vectorizer = TfidfVectorizer(tokenizer=tokenize_words, sublinear_tf=True, use_idf=True, smooth_idf=False)
Here, tokenize_words
is my function for tokenizing words.
The list of TF-IDF of my data is:
[[ 0.6088451 0.35959372 0.35959372 0.6088451 0. ]
[ 0. 0.45329466 0.45329466 0. 0.76749457]]
The list of features:
['???', '???', '??.', '????', '?????']
The value of idf:
{'????': 1.6931471805599454, '??.': 1.0, '???': 1.6931471805599454, '?????': 1.6931471805599454, '???': 1.0}
Please explain me in this example what shall be the term frequency of each term in my both documents.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…