Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

nlp - Why are there rows with all values ?0 in the embedding matrix?

I created the word embedding vector for sentiment analysis. But I'm not sure about the code I wrote. If you see my mistakes while creating Word2vec or embedding matrix, please let me know.

EMBEDDING_DIM=100 
review_lines = [sub.split() for sub in reviews]    
model = gensim.models.Word2Vec(sentences=review_lines,size=EMBEDDING_DIM,window=6,workers=6,min_count=3,sg=1) 
print('Words close to the given word:',model.wv.most_similar('film'))    
words=list(model.wv.vocab) 
print('Words:' , words)

file_name='embedding_word2vec.txt'
model.wv.save_word2vec_format(file_name,binary=False)     
embeddings_index = {}    
f=open(os.path.join('','embedding_word2vec.txt'),encoding="utf-8")    
for line in f:    
  values =line.split()    
  word=values[0]   
  coefs=np.asarray(values[1:],dtype='float32')   
  embeddings_index[word]=coefs    
f.close()  
print("Number of word vectors found:",len(embeddings_index))
  
embedding_matrix = np.zeros((len(word_index)+1,EMBEDDING_DIM))
for word , i in word_index.items():
  embedding_vector= embeddings_index.get(word)
  if embedding_vector is not None:
    embedding_matrix[i]=embedding_vector

OUTPUT:
array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.1029947 ,  0.07595579, -0.06583303, ...,  0.10382118,
        -0.56950015, -0.17402627],
       [ 0.13758609,  0.05489254,  0.0969701 , ...,  0.18532865,
        -0.49845088, -0.23407038],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])
question from:https://stackoverflow.com/questions/65885615/why-are-there-rows-with-all-values-0-in-the-embedding-matrix

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It's likely the zero rows are there because you initialized the embedding_matrix with all zeros, but then your loop didn't replace those zeros for every row.

If any of the words in word_index aren't in the embeddings_index dict you've built (or the model before that, that would be the expected result.

Note that while the saved word-vector format isn't very complicated, you still don't nee to write your own code to parse it back in. The KeyedVectors.load_word2vec_format() method will work for that, giving you an object that allows dict-like access to each vector, by its word key. (And, the vectors are stored in a dense array, so it's a bit more memory efficient than a true dict with a separate ndarray vector as each value.)

There would still be the issue of your word_index listing words that weren't trained by the model. Perhaps they weren't in your training texts, or didn't appear at least min_count (default: 5) times, as required for the model to take notice of them. (You could consider lowering min_count, but note that it's usually a good idea to discard such very-rare words - they wouldn't have created very good vectors from few examples, and even including such thinly-represented words can worsen surrounding word's vectors.)

If you absolutely need vectors for words no in your training data, the FastText variant of the word2vec algorithm can, in languages where similar words often share similar character-runs, offer synthesized vectors for unknown words that are somewhat better than random/null-vectors for most downstream applications. But you really should prefer to have adequate real examples of each interesting words' usage in varying contexts.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...