The provided recurrent layers inherit from a base implementation keras.layers.Recurrent
, which includes the option return_sequences
, which defaults to False
. What this means is that by default, recurrent layers will consume variable-length inputs and ultimately produce only the layer's output at the final sequential step.
As a result, there is no problem using None
to specify a variable-length input sequence dimension.
However, if you wanted the layer to return the full sequence of output, i.e. the tensor of outputs for each step of the input sequence, then you'd have to further deal with the variable size of that output.
You could do this by having the next layer further accept a variable-sized input, and punt on the problem until later on in your network when eventually you either must calculate a loss function from some variable-length thing, or else calculate some fixed-length representation before continuing on to later layers, depending on your model.
Or you could do it by requiring fixed-length sequences, possibly with padding the end of the sequences with special sentinel values that merely indicate an empty sequence item purely for padding out the length.
Separately, the Embedding
layer is a very special layer that is built to handle variable length inputs as well. The output shape will have a different embedding vector for each token of the input sequence, so the shape with be (batch size, sequence length, embedding dimension). Since the next layer is LSTM, this is no problem ... it will happily consume variable-length sequences as well.
But as it is mentioned in the documentation on Embedding
:
input_length: Length of input sequences, when it is constant.
This argument is required if you are going to connect
`Flatten` then `Dense` layers upstream
(without it, the shape of the dense outputs cannot be computed).
If you want to go directly from Embedding
to a non-variable-length representation, then you must supply the fixed sequence length as part of the layer.
Finally, note that when you express the dimensionality of the LSTM layer, such as LSTM(32)
, you are describing the dimensionality of the output space of that layer.
# example sequence of input, e.g. batch size is 1.
[
[34],
[27],
...
]
--> # feed into embedding layer
[
[64-d representation of token 34 ...],
[64-d representation of token 27 ...],
...
]
--> # feed into LSTM layer
[32-d output vector of the final sequence step of LSTM]
In order to avoid the inefficiency of a batch size of 1, one tactic is to sort your input training data by the sequence length of each example, and then group into batches based on common sequence length, such as with a custom Keras DataGenerator.
This has the advantage of allowing large batch sizes, especially if your model may need something like batch normalization or involves GPU-intensive training, and even just for the benefit of a less noisy estimate of the gradient for batch updates. But it still lets you work on an input training data set that has different batch lengths for different examples.
More importantly though, it also has the big advantage that you do not have to manage any padding to ensure common sequence lengths in the input.