Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
4.3k views
in Technique[技术] by (71.8m points)

python - Word vector similarity precision

I am trying to implement Gensim's most_similar function by hand but calculate the similarity between the query word and just one other word (avoiding the time to calculate it for the query word with all other words). So far I use

cossim = (np.dot(a, b)
                   / np.linalg.norm(a)
                   / np.linalg.norm(b))

and this is the same as the similarity result between a and b. I find this works almost exactly but that some precision is lost, for example

from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

model_gigaword = api.load("glove-wiki-gigaword-300")

a = 'france'
b = 'chirac'

cossim1 = model_gigaword.most_similar(a)
import numpy as np
cossim2 = (np.dot(model_gigaword[a], model_gigaword[b])
                   / np.linalg.norm(model_gigaword[a])
                   / np.linalg.norm(model_gigaword[b]))
print(cossim1)
print(cossim2)

Output:

[('french', 0.7344760894775391), ('paris', 0.6580672264099121), ('belgium', 0.620672345161438), ('spain', 0.573593258857727), ('italy', 0.5643460154533386), ('germany', 0.5567398071289062), ('prohertrib', 0.5564222931861877), ('britain', 0.5553334355354309), ('chirac', 0.5362644195556641), ('switzerland', 0.5320892333984375)]
0.53626436

So the most_similar function gives 0.53626441955... (rounds to 0.53626442) and the calculation with numpy gives 0.53626436. Similarly, you can see differences between the values for 'paris' and 'italy' (in similarity compared to 'france'). These differences suggest that the calculation is not being done to full precision (but it is in Gensim). How can I fix it and get the output for a single similarity to higher precision, exactly as it comes from most_similar?

TL/DR - I want to use function('france', 'chirac') and get 0.5362644195556641, not 0.53626436.

Any idea what's going on?


UPDATE: I should clarify, I want to know and replicate how most_similar does the computation, but for only one (a,b) pair. That's my priority, rather than finding out how to improve the precision of my cossim calculation above. I just assumed the two were equivalent.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

To increase accuracy you can try the following:

a = np.array(model_gigaword[a]).astype('float128')
b = np.array(model_gigaword[b]).astype('float128')
cossim = (np.dot(a, b)
                   / np.linalg.norm(a)
                   / np.linalg.norm(b))

The vectors are likely to use lower-precision floats and hence there is loss precision in calculations.

However, the results I got are somewhat different to what model_gigaword.most_similar offers for you:

model_gigaword.similarity: 0.5362644
float64:  0.5362644263010196
float128: 0.53626442630101950744

You may want to check what you get on your machine and with your version of Python and gensim.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...