Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

tensorflow - Using Gensim Fasttext model with LSTM nn in keras

I have trained fasttext model with Gensim over the corpus of very short sentences (up to 10 words). I know that my test set includes words that are not in my train corpus, i.e some of the words in my corpus are like "Oxytocin" "Lexitocin", "Ematrophin",'Betaxitocin"

given a new word in the test set, fasttext knows pretty well to generate a vector with high cosine-similarity to the other similar words in the train set by using the characters level n-gram

How do i incorporate the fasttext model inside a LSTM keras network without losing the fasttext model to just a list of vectors in the vocab? because then I won't handle any OOV even when fasttext do it well.

Any idea?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

here the procedure to incorporate the fasttext model inside an LSTM Keras network

# define dummy data and precproces them

docs = ['Well done',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent',
        'Weak',
        'Poor effort',
        'not good',
        'poor work',
        'Could have done better']

docs = [d.lower().split() for d in docs]

# train fasttext from gensim api

ft = FastText(size=10, window=2, min_count=1, seed=33)
ft.build_vocab(docs)
ft.train(docs, total_examples=ft.corpus_count, epochs=10)

# prepare text for keras neural network

max_len = 8

tokenizer = tf.keras.preprocessing.text.Tokenizer(lower=True)
tokenizer.fit_on_texts(docs)

sequence_docs = tokenizer.texts_to_sequences(docs)
sequence_docs = tf.keras.preprocessing.sequence.pad_sequences(sequence_docs, maxlen=max_len)

# extract fasttext learned embedding and put them in a numpy array

embedding_matrix_ft = np.random.random((len(tokenizer.word_index) + 1, ft.vector_size))

pas = 0
for word,i in tokenizer.word_index.items():
    
    try:
        embedding_matrix_ft[i] = ft.wv[word]
    except:
        pas+=1

# define a keras model and load the pretrained fasttext weights matrix

inp = Input(shape=(max_len,))
emb = Embedding(len(tokenizer.word_index) + 1, ft.vector_size, 
                weights=[embedding_matrix_ft], trainable=False)(inp)
x = LSTM(32)(emb)
out = Dense(1)(x)

model = Model(inp, out)

model.predict(sequence_docs)

how to deal unseen text

unseen_docs = ['asdcs work','good nxsqa zajxa']
unseen_docs = [d.lower().split() for d in unseen_docs]

sequence_unseen_docs = tokenizer.texts_to_sequences(unseen_docs)
sequence_unseen_docs = tf.keras.preprocessing.sequence.pad_sequences(sequence_unseen_docs, maxlen=max_len)

model.predict(sequence_unseen_docs)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...