python - How to visualize RNN/LSTM gradients in Keras/TensorFlow?

Question

Welcome To Ask or Share your Answers For Others

python - How to visualize RNN/LSTM gradients in Keras/TensorFlow?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to visualize RNN/LSTM gradients in Keras/TensorFlow?

I've come across research publications and Q&A's discussing a need for inspecting RNN gradients per backpropagation through time (BPTT) - i.e., gradient for each timestep. The main use is introspection: how do we know if an RNN is learning long-term dependencies? A question of its own topic, but the most important insight is gradient flow:

If a non-zero gradient flows through every timestep, then every timestep contributes to learning - i.e., resultant gradients stem from accounting for every input timestep, so the entire sequence influences weight updates
Per above, an RNN no longer ignores portions of long sequences, and is forced to learn from them

... but how do I actually visualize these gradients in Keras / TensorFlow? Some related answers are in the right direction, but they seem to fail for bidirectional RNNs, and only show how to get a layer's gradients, not how to meaningfully visualize them (the output is a 3D tensor - how do I plot it?)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T01:13:36+0000

Gradients can be fetched w.r.t. weights or outputs - we'll be needing latter. Further, for best results, an architecture-specific treatment is desired. Below code & explanations cover every possible case of a Keras/TF RNN, and should be easily expandable to any future API changes.

Completeness: code shown is a simplified version - the full version can be found at my repository, See RNN (this post included w/ bigger images); included are:

Greater visual custsomizability
Docstrings explaining all functionality
Support for Eager, Graph, TF1, TF2, and from keras & from tf.keras
Activations visualization
Weights gradients visualization (coming soon)
Weights visualization (coming soon)

I/O dimensionalities (all RNNs):

Input: (batch_size, timesteps, channels) - or, equivalently, (samples, timesteps, features)
Output: same as Input, except:
- channels/features is now the # of RNN units, and:
- return_sequences=True --> timesteps_out = timesteps_in (output a prediction for each input timestep)
- return_sequences=False --> timesteps_out = 1 (output prediction only at the last timestep processed)

Visualization methods:

1D plot grid: plot gradient vs. timesteps for each of the channels
2D heatmap: plot channels vs. timesteps w/ gradient intensity heatmap
0D aligned scatter: plot gradient for each channel per sample
histogram: no good way to represent "vs. timesteps" relations
One sample: do each of above for a single sample
Entire batch: do each of above for all samples in a batch; requires careful treatment

# for below examples
grads = get_rnn_gradients(model, x, y, layer_idx=1) # return_sequences=True
grads = get_rnn_gradients(model, x, y, layer_idx=2) # return_sequences=False

EX 1: one sample, uni-LSTM, 6 units -- return_sequences=True, trained for 20 iterations
show_features_1D(grads[0], n_rows=2)

Note: gradients are to be read right-to-left, as they're computed (from last timestep to first)
Rightmost (latest) timesteps consistently have a higher gradient
Vanishing gradient: ~75% of leftmost timesteps have a zero gradient, indicating poor time dependency learning

EX 2: all (16) samples, uni-LSTM, 6 units -- return_sequences=True, trained for 20 iterations
show_features_1D(grads, n_rows=2)
show_features_2D(grads, n_rows=4, norm=(-.01, .01))

Each sample shown in a different color (but same color per sample across channels)
Some samples perform better than one shown above, but not by much
The heatmap plots channels (y-axis) vs. timesteps (x-axis); blue=-0.01, red=0.01, white=0 (gradient values)

EX 3: all (16) samples, uni-LSTM, 6 units -- return_sequences=True, trained for 200 iterations
show_features_1D(grads, n_rows=2)
show_features_2D(grads, n_rows=4, norm=(-.01, .01))

Both plots show the LSTM performing clearly better after 180 additional iterations
Gradient still vanishes for about half the timesteps
All LSTM units better capture time dependencies of one particular sample (blue curve, all plots) - which we can tell from the heatmap to be the first sample. We can plot that sample vs. other samples to try to understand the difference

EX 4: 2D vs. 1D, uni-LSTM: 256 units, return_sequences=True, trained for 200 iterations
show_features_1D(grads[0])
show_features_2D(grads[:, :, 0], norm=(-.0001, .0001))

2D is better suited for comparing many channels across few samples
1D is better suited for comparing many samples across a few channels

EX 5: bi-GRU, 256 units (512 total) -- return_sequences=True, trained for 400 iterations
show_features_2D(grads[0], norm=(-.0001, .0001), reflect_half=True)

Backward layer's gradients are flipped for consistency w.r.t. time axis
Plot reveals a lesser-known advantage of Bi-RNNs - information utility: the collective gradient covers about twice the data. However, this isn't free lunch: each layer is an independent feature extractor, so learning isn't really complemented
Lower norm for more units is expected, as approx. the same loss-derived gradient is being distributed across more parameters (hence the squared numeric average is less)

EX 6: 0D, all (16) samples, uni-LSTM, 6 units -- return_sequences=False, trained for 200 iterations
show_features_0D(grads)

return_sequences=False utilizes only the last timestep's gradient (which is still derived from all timesteps, unless using truncated BPTT), requiring a new approach
Plot color-codes each RNN unit consistently across samples for comparison (can use one color instead)
Evaluating gradient flow is less direct and more theoretically involved. One simple approach is to compare distributions at beginning vs. later in training: if the difference isn't significant, the RNN does poorly in learning long-term dependencies

EX 7: LSTM vs. GRU vs. SimpleRNN, unidir, 256 units -- return_sequences=True, trained for 250 iterations
show_features_2D(grads, n_rows=8, norm=(-.0001, .0001), show_xy_ticks=[0,0], show_title=False)

Note: the comparison isn't very meaningful; each network thrives w/ different hyperparameters, whereas same ones were used for all. LSTM, for one, bears the most parameters per unit, drowning out SimpleRNN
In this setup, LSTM definitively stomps GRU and SimpleRNN

Visualization functions:

def get_rnn_gradients(model, input_data, labels, layer_idx=None, layer_name=None, 
                      sample_weights=None):
    if layer is None:
        layer = _get_layer(model, layer_idx, layer_name)

    grads_fn = _make_grads_fn(model, layer, mode)
    sample_weights = sample_weights or np.ones(len(input_data))
    grads = grads_fn([input_data, sample_weights, labels, 1])

    while type(grads) == list:
        grads = grads[0]
    return grads

def _make_grads_fn(model, layer):
    grads = model.optimizer.get_gradients(model.total_loss, layer.output)
    return K.function(inputs=[model.inputs[0],  model.sample_weights[0],
                              model._feed_targets[0], K.learning_phase()], outputs=grads) 

def _get_layer(model, layer_idx=None, layer_name=None):
    if layer_idx is not None:
        return model.layers[layer_idx]

    layer = [layer for layer in model.layers if layer_name in layer.name]
    if len(layer) > 1:
        print("WARNING: multiple matching layer names found; "
              + "picking earliest")
    return layer[0]


def show_features_1D(data, n_rows=None, label_channels=True,
                     equate_axes=True, max_timesteps=None, color=None,
                     show_title=True, show_borders=True, show_xy_ticks=[1,1], 
                     title_fontsize=14, channel_axis=-1, 
                     scale_width=1, scale_height=1, dpi=76):
    def _get_title(data, show_title):
        if len(data.shape)==3:
            return "((Gradients vs. Timesteps) vs. Samples) vs. Channels"
        else:        
            return "((Gradients vs. Timesteps) vs. Channels"

    def _get_feature_outputs(data, subplot_idx):
        if len(data.shape)==3:
            feature_outputs = []
            for entry in data:
                feature_outputs.append(entry[:, subplot_idx-1][:max_timesteps])
            return feature_outputs
        else:
            return [data[:, subplot_idx-1][:max_timesteps]]

    if len(data.shape)!=2 and len(data.shape)!=3:
        raise Exception("`data` must be 2D or 3D")

    if len(data.shape)==3:
        n_features = data[0].shape[channel_axis]
    else:
        n_features = data.shape[channel_axis]
    n_cols = int(n_features / n_rows)

    if color is None:
        n_colors = len(data) if len(data.shape)==3 else 1
        color = [None] * n_colors

    fig, axes = plt.subplots(n_rows, n_cols, sharey=equate_axes, dpi=dpi)
    axes = np.asarray(axes)

    if show_title:
        title = _get_title(data, show_title)
        plt.suptitle(title, weight='bold', fontsize=title_fontsize)
    fig.set_size_inches(12*scale_width, 8*scale_height)

    for ax_idx, ax in enumerate(axes.flat):
        feature_outputs = _get_feature_outputs(data, ax_idx)
        for idx, feature_output in enumerate(feature_outputs):
            ax.plot(feature_output, color=color[idx])

        ax.axis(xmin=0, xmax=len(feature_outputs[0]))
        if not show_xy_ticks[0]:
            ax.set_xticks([])
        if not show_xy_ticks[1]:
            ax.set_yticks([])
        if label_channels:
            ax.annotate(str(ax_idx), weight='bold',
                        color='g', xycoords='axes fraction',
                        fontsize=16, xy=(.03, .9))
        if not show_borders:
            ax.set_frame_on(False)

    if equate_axes:
        y_new = []
        for row_axis in axes:
            y_new += [np.max(np.abs([col_axis.get_ylim() for
                                     col_axis in row_axis]))]
        y_new = np.max(y_new)
        for row_axis in axes:
            [col_axis.set_ylim(-y_new, y_new) for col_axis in row_axis]
    plt.show()


def show_features_2D(data, n_rows=None, norm=

Categories

python - How to visualize RNN/LSTM gradients in Keras/TensorFlow?

python - How to visualize RNN/LSTM gradients in Keras/TensorFlow?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags