I have a model that I've trained for 40 epochs. I kept checkpoints for each epochs, and I have also saved the model with model.save()
. The code for training is:
n_units = 1000
model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')
# define the checkpoint
filepath="word2vec-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(x, y, epochs=40, batch_size=50, callbacks=callbacks_list)
However, when I load the model and try training it again, it starts all over as if it hasn't been trained before. The loss doesn't start from the last training.
What confuses me is when I load the model and redefine the model structure and use load_weight
, model.predict()
works well. Thus, I believe the model weights are loaded:
model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
filename = "word2vec-39-0.0027.hdf5"
model.load_weights(filename)
model.compile(loss='mean_squared_error', optimizer='adam')
However, When I continue training with this, the loss is as high as the initial stage:
filepath="word2vec-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(x, y, epochs=40, batch_size=50, callbacks=callbacks_list)
I searched and found some examples of saving and loading models here and here. However, none of them work.
Update 1
I looked at this question, tried it and it works:
model.save('partly_trained.h5')
del model
load_model('partly_trained.h5')
But when I close Python and reopen it, then run load_model
again, it fails. The loss is as high as the initial state.
Update 2
I tried Yu-Yang's example code and it works. However, when I use my code again, it still failed.
This is result form the original training. The second epoch should start with loss = 3.1***:
13700/13846 [============================>.] - ETA: 0s - loss: 3.0519
13750/13846 [============================>.] - ETA: 0s - loss: 3.0511
13800/13846 [============================>.] - ETA: 0s - loss: 3.0512Epoch 00000: loss improved from inf to 3.05101, saving model to LPT-00-3.0510.h5
13846/13846 [==============================] - 81s - loss: 3.0510
Epoch 2/60
50/13846 [..............................] - ETA: 80s - loss: 3.1754
100/13846 [..............................] - ETA: 78s - loss: 3.1174
150/13846 [..............................] - ETA: 78s - loss: 3.0745
I closed Python, reopened it, loaded the model with model = load_model("LPT-00-3.0510.h5")
then train with:
filepath="LPT-{epoch:02d}-{loss:.4f}.h5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(x, y, epochs=60, batch_size=50, callbacks=callbacks_list)
The loss starts with 4.54:
Epoch 1/60
50/13846 [..............................] - ETA: 162s - loss: 4.5451
100/13846 [..............................] - ETA: 113s - loss: 4.3835
See Question&Answers more detail:
os