With Sequence Padding
There are two issues. You need to use pad_sequences
on the text sequence first. And also there is no such param input_shape
in SimpleRNN
. Try with the following code:
max_features = 20000 # Only consider the top 20k words
maxlen = 200 # Only consider the first 200 words of each movie review
batch_size = 1
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), "Training sequences")
print(len(x_test), "Validation sequences")
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim=32))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train, batch_size=batch_size,
epochs=10, validation_split=0.2)
Here is the official code example, it might help you.
With Sequence Padding with Mask in Embedding Layer
Based on your comments and information, It seems that it's possible to use a variable-length input sequence, check this and this too. But still, I can say, in most of the cases practitioner would prefer to pad
the sequences for uniform length; as it's convincing. Choosing non-uniform or variable input sequence length is some kind of special case; similar to when we want variable input image sizes for vision models.
However, here we will add info on padding
and how we can mask
out the padded value in training time which technically seems variable-length input training. Hope that convinces you. Let's first understand what pad_sequences
do. Normally in sequence data, it's very much a common case that, each training samples are in a different length. Let's consider the following inputs:
raw_inputs = [
[711, 632, 71],
[73, 8, 3215, 55, 927],
[83, 91, 1, 645, 1253, 927],
]
These 3 training samples are in different lengths, 3, 5, and 6 respectively. What we do next is to make them all equal lengths by adding some value (typically 0
or -1
) - whether at the beginning or at the end of the sequence.
tf.keras.preprocessing.sequence.pad_sequences(
raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)
array([[ 0, 0, 0, 711, 632, 71],
[ 0, 73, 8, 3215, 55, 927],
[ 83, 91, 1, 645, 1253, 927]], dtype=int32)
We can set padding = "post"
to set pad value at the end of the sequence. But it recommends using "post"
padding when working with RNN
layers in order to be able to use the CuDNN
implementation of the layers. However, FYI, you may notice we set maxlen = 6
which is the highest input sequence length. But it does not have to be the highest input sequence length as it may get computationally expensive if the dataset gets bigger. We can set it to 5
assuming that our model can learn feature representation within this length, it's a kind of hyper-parameter. And that brings another parameter truncating
.
tf.keras.preprocessing.sequence.pad_sequences(
raw_inputs, maxlen=5, dtype="int32", padding="pre", truncating="pre", value=0.0
)
array([[ 0, 0, 711, 632, 71],
[ 73, 8, 3215, 55, 927],
[ 91, 1, 645, 1253, 927]], dtype=int32
Okay, now we have a padded input sequence, all inputs are uniform length. Now, we can mask
out those additional padded values in training time. We will tell the model some part of the data is padding and those should be ignored. That mechanism is masking. So, it's a way to tell sequence-processing layers that certain timesteps in the input are missing, and thus should be skipped when processing the data. There are three ways to introduce input masks in Keras
models:
- Add a
keras. layers.Masking layer
.
- Configure a
keras.layers.Embedding
layer with mask_zero=True
.
- Pass a mask argument manually when calling layers that support this argument (e.g.
RNN
layers).
Here we will show only by configuring the Embedding
layer. It has a parameter called mask_zero
and set False
by default. If we set it True
then 0
containing indices in the sequences will be skipped. False
entry indicates that the corresponding timestep should be ignored during processing.
padd_input = tf.keras.preprocessing.sequence.pad_sequences(
raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)
print(padd_input)
embedding = tf.keras.layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
masked_output = embedding(padd_input)
print(masked_output._keras_mask)
[[ 0 0 0 711 632 71]
[ 0 73 8 3215 55 927]
[ 83 91 1 645 1253 927]]
tf.Tensor(
[[False False False True True True]
[False True True True True True]
[ True True True True True True]], shape=(3, 6), dtype=bool)
And here is how it's computed in the class Embedding(Layer)
.
def compute_mask(self, inputs, mask=None):
if not self.mask_zero:
return None
return tf.not_equal(inputs, 0)
And here is one catch, if we set mask_zero
as True
, as a consequence, index 0
cannot be used in the vocabulary. According to the doc
mask_zero: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True
, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).
So, we have to use max_features + 1
at least. Here is a nice explanation on this.
Here is the complete example using these of your code.
# get the data
(x_train, y_train), (_, _) = imdb.load_data(num_words=max_features)
print(x_train.shape)
# check highest sequence lenght
max_list_length = lambda list: max( [len(i) for i in list])
print(max_list_idx(x_train))
max_features = 20000 # Only consider the top 20k words
maxlen = 350 # Only consider the first 350 words out of `max_list_idx(x_train)`
batch_size = 512
print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])
# (1). padding with value 0 at the end of the sequence - padding="post", value=0.
# (2). truncate 'maxlen' words
# out of `max_list_idx(x_train)` at the end - maxlen=maxlen, truncating="post"
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train,
maxlen=maxlen, dtype="int32",
padding="post", truncating="post",
value=0.)
print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])
Your model definition should be now
model = Sequential()
model.add(Embedding(
input_dim=max_features + 1,
output_dim=32,
mask_zero=True))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train,
batch_size=256,
epochs=1, validation_split=0.2)
639ms/step - loss: 0.6774 - acc: 0.5640 - val_loss: 0.5034 - val_acc: 0.8036
References