Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
159 views
in Technique[技术] by (71.8m points)

python 3.x - How term frequency is calculated in TfidfVectorizer?

I searched a lot for understanding this but I am not able to. I understand that by default TfidfVectorizer will apply l2 normalization on term frequency. This article explain the equation of it. I am using TfidfVectorizer on my text written in Gujarati language. Following is details of output about it:

My two documents are:

??? ???? ??? ??

??? ????? ??

The code I am using is:

vectorizer = TfidfVectorizer(tokenizer=tokenize_words, sublinear_tf=True, use_idf=True, smooth_idf=False)

Here, tokenize_words is my function for tokenizing words. The list of TF-IDF of my data is:

[[ 0.6088451   0.35959372  0.35959372  0.6088451   0.        ]
 [ 0.          0.45329466  0.45329466  0.          0.76749457]]

The list of features:

['???', '???', '??.', '????', '?????']

The value of idf:

{'????': 1.6931471805599454, '??.': 1.0, '???': 1.6931471805599454, '?????': 1.6931471805599454, '???': 1.0}

Please explain me in this example what shall be the term frequency of each term in my both documents.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Ok, Now lets go through the documentation I gave in comments step by step:

Documents:

`??? ???? ??? ??
 ??? ????? ??`
  1. Get all unique terms (features): ['???', '???', '??.', '????', '?????']
  2. Calculate frequency of each term in documents:-

    a. Each term present in document1 [??? ???? ??? ??] is present once, and ????? is not present.`

    b. So the term frequency vector (sorted according to features): [1 1 1 1 0]

    c. Applying steps a and b on document2, we get [0 1 1 0 1]

    d. So our final term-frequency vector is [[1 1 1 1 0], [0 1 1 0 1]]

    Note: This is the term frequency you want

  3. Now find IDF (This is based on features, not on document basis):

    idf(term) = log(number of documents/number of documents with this term) + 1

    1 is added to the idf value to prevent zero divisions. It is governed by "smooth_idf" parameter which is True by default.

    idf('???') = log(2/1)+1 = 0.69314.. + 1 = 1.69314..
    
    idf('???') = log(2/2)+1 = 0 + 1 = 1
    
    idf('??.') = log(2/2)+1 = 0 + 1 = 1
    
    idf('????') = log(2/1)+1 = 0.69314.. + 1 = 1.69314..
    
    idf('?????') = log(2/1)+1 = 0.69314.. + 1 = 1.69314..
    

    Note: This corresponds to the data you showed in question.

  4. Now calculate TF-IDF (This again is calculated document-wise, calculated according to sorting of features):

    a. For document1:

     For '???', tf-idf = tf(???) x idf(???) = 1 x 1.69314 = 1.69314
    
     For '???', tf-idf = tf(???) x idf(???) = 1 x 1 = 1
    
     For '??.', tf-idf = tf(???) x idf(???) = 1 x 1 = 1
    
     For '????', tf-idf = tf(???) x idf(???) = 1 x 1.69314 = 1.69314
    
     For '?????', tf-idf = tf(???) x idf(???) = 0 x 1.69314 = 0
    

    So for document1, the final tf-idf vector is [1.69314 1 1 1.69314 0]

    b. Now normalization is done (l2 Euclidean):

    dividor = sqrt(sqr(1.69314)+sqr(1)+sqr(1)+sqr(1.69314)+sqr(0))
             = sqrt(2.8667230596 + 1 + 1 + 2.8667230596 + 0)
             = sqrt(7.7334461192)
             = 2.7809074272977876...
    

    Dividing each element of the tf-idf array with dividor, we get:

    [0.6088445 0.3595948 0.3595948548 0.6088445 0]

    Note: This is the tfidf of firt document you posted in question.

    c. Now do the same steps a and b for document 2, we get:

    [ 0. 0.453294 0.453294 0. 0.767494]

Update: About sublinear_tf = True OR False

Your original term frequency vector is [[1 1 1 1 0], [0 1 1 0 1]] and you are correct in your understanding that using sublinear_tf = True will change the term frequency vector.

new_tf = 1 + log(tf)

Now the above line will only work on non zero elements in the term-frequecny. Because for 0, log(0) is undefined.

And all your non-zero entries are 1. log(1) is 0 and 1 + log(1) = 1 + 0 = 1`.

You see that the values will remain unchanged for elements with value 1. So your new_tf = [[1 1 1 1 0], [0 1 1 0 1]] = tf(original).

Your term frequency is changing due to the sublinear_tf but it still remains the same.

And hence all below calculations will be same and output is same if you use sublinear_tf=True OR sublinear_tf=False.

Now if you change your documents for which the term-frequecy vector contains elements other than 1 and 0, you will get differences using the sublinear_tf.

Hope your doubts are cleared now.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...