Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
475 views
in Technique[技术] by (71.8m points)

machine learning - TimeDistributed(Dense) vs Dense in Keras - Same number of parameters

I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.

My simplified model is the following:

InputSize = 15
MaxLen = 64
HiddenSize = 16

inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)
predictions = keras.layers.Activation('softmax')(x)

The summary of the network is:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 64, 15)            0         
_________________________________________________________________
gru_1 (GRU)                  (None, 64, 16)            1536      
_________________________________________________________________
time_distributed_1 (TimeDist (None, 64, 15)            255       
_________________________________________________________________
activation_1 (Activation)    (None, 64, 15)            0         
=================================================================

This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).

However, if I switch to a simple Dense layer:

inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.Dense(InputSize)(x)
predictions = keras.layers.Activation('softmax')(x)

I still only have 255 parameters:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 64, 15)            0         
_________________________________________________________________
gru_1 (GRU)                  (None, 64, 16)            1536      
_________________________________________________________________
dense_1 (Dense)              (None, 64, 15)            255       
_________________________________________________________________
activation_1 (Activation)    (None, 64, 15)            0         
=================================================================

I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).

Update Looking at https://github.com/fchollet/keras/blob/master/keras/layers/core.py it does seem that Dense uses the last dimension only to size itself:

def build(self, input_shape):
    assert len(input_shape) >= 2
    input_dim = input_shape[-1]

    self.kernel = self.add_weight(shape=(input_dim, self.units),

It also uses keras.dot to apply the weights:

def call(self, inputs):
    output = K.dot(inputs, self.kernel)

The docs of keras.dot imply that it works fine on n-dimensional tensors. I wonder if its exact behavior means that Dense() will in effect be called at every time step. If so, the question still remains what TimeDistributed() achieves in this case.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

TimeDistributedDense applies a same dense to every time step during GRU/LSTM Cell unrolling. So the error function will be between predicted label sequence and the actual label sequence. (Which is normally the requirement for sequence to sequence labeling problems).

However, with return_sequences=False, Dense layer is applied only once at the last cell. This is normally the case when RNNs are used for classification problem. If return_sequences=True then Dense layer is applied to every timestep just like TimeDistributedDense.

So for as per your models both are same, but if you change your second model to return_sequences=False, then Dense will be applied only at the last cell. Try changing it and the model will throw as error because then the Y will be of size [Batch_size, InputSize], it is no more a sequence to sequence but a full sequence to label problem.

from keras.models import Sequential
from keras.layers import Dense, Activation, TimeDistributed
from keras.layers.recurrent import GRU
import numpy as np

InputSize = 15
MaxLen = 64
HiddenSize = 16

OutputSize = 8
n_samples = 1000

model1 = Sequential()
model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model1.add(TimeDistributed(Dense(OutputSize)))
model1.add(Activation('softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')


model2 = Sequential()
model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model2.add(Dense(OutputSize))
model2.add(Activation('softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')

model3 = Sequential()
model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize)))
model3.add(Dense(OutputSize))
model3.add(Activation('softmax'))
model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')

X = np.random.random([n_samples,MaxLen,InputSize])
Y1 = np.random.random([n_samples,MaxLen,OutputSize])
Y2 = np.random.random([n_samples, OutputSize])

model1.fit(X, Y1, batch_size=128, nb_epoch=1)
model2.fit(X, Y1, batch_size=128, nb_epoch=1)
model3.fit(X, Y2, batch_size=128, nb_epoch=1)

print(model1.summary())
print(model2.summary())
print(model3.summary())

In the above example architecture of model1 and model2 are sample (sequence to sequence models) and model3 is a full sequence to label model.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...