I am trying to implement Gensim's most_similar
function by hand but calculate the similarity between the query word and just one other word (avoiding the time to calculate it for the query word with all other words). So far I use
cossim = (np.dot(a, b)
/ np.linalg.norm(a)
/ np.linalg.norm(b))
and this is the same as the similarity result between a
and b
. I find this works almost exactly but that some precision is lost, for example
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
model_gigaword = api.load("glove-wiki-gigaword-300")
a = 'france'
b = 'chirac'
cossim1 = model_gigaword.most_similar(a)
import numpy as np
cossim2 = (np.dot(model_gigaword[a], model_gigaword[b])
/ np.linalg.norm(model_gigaword[a])
/ np.linalg.norm(model_gigaword[b]))
print(cossim1)
print(cossim2)
Output:
[('french', 0.7344760894775391), ('paris', 0.6580672264099121), ('belgium', 0.620672345161438), ('spain', 0.573593258857727), ('italy', 0.5643460154533386), ('germany', 0.5567398071289062), ('prohertrib', 0.5564222931861877), ('britain', 0.5553334355354309), ('chirac', 0.5362644195556641), ('switzerland', 0.5320892333984375)]
0.53626436
So the most_similar function gives 0.53626441955... (rounds to 0.53626442) and the calculation with numpy gives 0.53626436. Similarly, you can see differences between the values for 'paris' and 'italy' (in similarity compared to 'france'). These differences suggest that the calculation is not being done to full precision (but it is in Gensim). How can I fix it and get the output for a single similarity to higher precision, exactly as it comes from most_similar?
TL/DR - I want to use function('france', 'chirac') and get 0.5362644195556641, not 0.53626436.
Any idea what's going on?
UPDATE: I should clarify, I want to know and replicate how most_similar does the computation, but for only one (a,b) pair. That's my priority, rather than finding out how to improve the precision of my cossim calculation above. I just assumed the two were equivalent.