I am having trouble with a for loop inside a function. I am calculating cosine distances for a list of word vectors. with each vector, I am calculating the cosine distance and then appending it as a new column to the pandas dataframe. the problem is that there are several models, so i am comparing a word vector from model 1, with that word in every other model.
This means that some words are not present in all models. In this case, I use an exception for the KeyError and allow the loop to move on without throwing an error. If this happens, I also ask that a 0 value is added the pandas dataframe. This is causing duplicated indexes and am stuck with moving forward from here. The code is as follows:
from scipy.spatial.distance import cosine
import pandas as pd
def cosines(model1, model2, model3, model4, model5, model6, model7, words):
df = pd.DataFrame()
model = [model2, model3, model4, model5, model6, model7]
for i in model:
for j in words:
try:
cos = 1 - cosine(model1.wv[j], i.wv[j])
print(f'cosine for model1 vs {i.name:} {1 - cosine(model1[j], i[j])}')
tempdf = pd.DataFrame([cos], columns=[f'{j}'], index=[f'{i.name}'])
#print(tempdf)
df = pd.concat([df, tempdf], axis=0)
except KeyError:
print(word not present at {i.name}')
ke_tempdf = pd.DataFrame([0], columns=[f'{j}'], index=[f'{i.name}'])
df = pd.concat([df, ke_tempdf], axis=0)
pass
return df
The function works, however, for each KeyError - instead of adding a 0 at one row, it creates a new duplicated one with the value 0. With two words this duplicated the dataframe, but the ultimate aim is to have a list of many words. The resulting dataframe is found below:
word1 word2
model1 0.000000 NaN
model1 NaN 0.761573
model2 0.000000 NaN
model2 NaN 0.000000
model3 0.000000 NaN
model3 NaN 0.000000
model4 0.245140 NaN
model4 NaN 0.680306
model5 0.090268 NaN
model5 NaN 0.662234
model6 0.000000 NaN
model6 NaN 0.709828
As you can see for every word that isn't present, instead of adding a 0 to existing model row (NaN) it is adding a new row with the number 0. it should read: model1, 0, 0.76
etc, instead of the duplicated rows. any help is much appreciated, thank you!
question from:
https://stackoverflow.com/questions/65941350/duplicated-rows-in-pandas-append-inside-for-loop