Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
409 views
in Technique[技术] by (71.8m points)

python - How to remove a word completely from a Word2Vec model in gensim?

Given a model, e.g.

from gensim.models.word2vec import Word2Vec


documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

texts = [d.lower().split() for d in documents]

w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)

It's possible to remove the word from the w2v vocabulary, e.g.

# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433  0.08862179  0.08601206  0.05281207 -0.00673626]

>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)

# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]

# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"

But when we do a similarity on other words after deleting graph, we see the word graph popping up, e.g.

>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]

How to remove a word completely from a Word2Vec model in gensim?


Updated

To answer @vumaasha's comment:

could you give some details as to why you want to delete a word

  • Lets say my universe of words in all words in the corpus to learn the dense relations between all words.

  • But when I want to generate the similar words, it should only come from a subset of domain specific word.

  • It's possible to generate more than enough from .most_similar() then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient.

  • It would be better if the word is totally removed from the word vectors then the .most_similar() words won't return words outside of the specific domain.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I wrote a function which removes words from KeyedVectors which aren't in a predefined word list.

def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = new_vectors
    w2v.index2entity = new_index2entity
    w2v.index2word = new_index2entity
    w2v.vectors_norm = new_vectors_norm

It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.

Usage:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")

[('beers', 0.8409687876701355),
('lager', 0.7733745574951172),
('Beer', 0.71753990650177),
('drinks', 0.668931245803833),
('lagers', 0.6570086479187012),
('Yuengling_Lager', 0.655455470085144),
('microbrew', 0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372),
('suds', 0.6497018337249756),
('brewed_beer', 0.6490240097045898)]

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")

[('lagers', 0.6570085287094116),
('wine', 0.6217695474624634),
('bash', 0.20583480596542358),
('computer', 0.06677375733852386),
('python', 0.005948573350906372)]


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...