python - How to handle non-determinism when training on a GPU?

Question

Welcome To Ask or Share your Answers For Others

python - How to handle non-determinism when training on a GPU?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to handle non-determinism when training on a GPU?

While tuning the hyperparameters to get my model to perform better, I noticed that the score I get (and hence the model that is created) is different every time I run the code despite fixing all the seeds for random operations. This problem does not happen if I run on CPU.

I googled and found out that this is a common issue when using a GPU to train. Here is a very good/detailed example with short code snippets to verify the existence of that problem.

They pinpointed the non-determinism to "tf.reduce_sum" function. However, that is not the case for me. it could be because I'm using different hardware (1080 TI) or a different version of CUDA libraries or Tensorflow. It seems like there are many different parts of the CUDA libraries that are non-deterministic and it doesn't seem easy to figure out exactly which part and how to get rid of it. Also, this must have been by design, so it's likely that there is a sufficient efficiency increase in exchange for non-determinism.

So, my question is:

Since GPUs are popular for training NNs, people in this field must have a way to deal with non-determinism, because I can't see how else you'd be able to reliably tune the hyperparameters. What is the standard way to handle non-determinism when using a GPU?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:34:58+0000

TL;DR

Non-determinism for a priori deterministic operations come from concurrent (multi-threaded) implementations.
Despite constant progress on that front, TensorFlow does not currently guarantee determinism for all of its operations. After a quick search on the internet, it seems that the situation is similar with the other major toolkits.
During training, unless you are debugging an issue, it is OK to have fluctuations between runs. Uncertainty is in the nature of training, and it is wise to measure it and take it into account when comparing results – even when toolkits eventually reach perfect determinism in training.

That, but much longer

When you see neural network operations as mathematical operations, you would expect everything to be deterministic. Convolutions, activations, cross-entropy – everything here are mathematical equations and should be deterministic. Even pseudo-random operations such as shuffling, drop-out, noise and the likes, are entirely determined by a seed.

When you see those operations from their computational implementation, on the other hand, you see them as massively parallelized computations, which can be source of randomness unless you are very careful.

The heart of the problem is that, when you run operations on several parallel threads, you typically do not know which thread will end first. It is not important when threads operate on their own data, so for example, applying an activation function to a tensor should be deterministic. But when those threads need to synchronize, such as when you compute a sum, then the result may depend on the order of the summation, and in turn, on the order of which thread ended first.

From there, you have broadly speaking two options:

Keep non-determinism associated with simpler implementations.
Take extra care in the design of your parallel algorithm to reduce or remove non-determinism in your computation. The added constraint usually results in slower algorithms

Which route takes CuDNN? Well, mostly the deterministic one. In recent releases, deterministic operations are the norm rather than the exception. But it used to offer many non-deterministic operations, and more importantly, it used to not offer some operations such as reduction, that people needed to implement themselves in CUDA with a variable degree of consideration to determinism.

Some libraries such as theano were more ahead of this topic, by exposing early on a deterministic flag that the user could turn on or off – but as you can see from its description, it is far from offering any guarantee.

If more, sometimes we will select some implementation that are more deterministic, but slower. In particular, on the GPU, we will avoid using AtomicAdd. Sometimes we will still use non-deterministic implementaion, e.g. when we do not have a GPU implementation that is deterministic. Also see the dnn.conv.algo* flags to cover more cases.

In TensorFlow, the realization of the need for determinism has been rather late, but it's slow getting there – helped by the advance of CuDNN on that front also. For a long time, reductions have been non-deterministic, but now they seem to be deterministic. The fact that CuDNN introduced deterministic reductions in version 6.0 may have helped of course.

It seems that currently, the main obstacle for TensorFlow towards determinism is the backward pass of the convolution. It is indeed one of the few operations for which CuDNN proposes a non-deterministic algorithm, labeled CUDNN_CONVOLUTION_BWD_FILTER_ALGO_0. This algorithm is still in the list of possible choices for the backward filter in TensorFlow. And since the choice of the filter seems to be based on performance, it could indeed be picked if it is more efficient. (I am not so familiar with TensorFlow's C++ code so take this with a grain of salt.)

Is this important?

If you are debugging an issue, determinism is not merely important: it is mandatory. You need to reproduce the steps that led to a problem. This is currently a real issue with toolkits like TensorFlow. To mitigate this problem, your only option is to debug live, adding checks and breakpoints at the correct locations – not great.

Deployment is another aspect of things, where it is often desirable to have a deterministic behavior, in part for human acceptance. While nobody would reasonably expect a medical diagnosis algorithm to never fail, it would be awkward that a computer could give the same patient a different diagnosis depending on the run. (Although doctors themselves are not immune to this kind of variability.)

Those reasons are rightful motivations to fix non-determinism in neural networks.

For all other aspects, I would say that we need to accept, if not embrace, the non-deterministic nature of neural net training. For all purposes, training is stochastic. We use stochastic gradient descent, shuffle data, use random initialization and dropout – and more importantly, training data is itself but a random sample of data. From that standpoint, the fact that computers can only generate pseudo-random numbers with a seed is an artifact. When you train, your loss is a value that also comes with a confidence interval due to this stochastic nature. Comparing those values to optimize hyper-parameters while ignoring those confidence intervals does not make much sense – therefore it is vain, in my opinion, to spend too much effort fixing non-determinism in that, and many other, cases.

Categories

python - How to handle non-determinism when training on a GPU?

python - How to handle non-determinism when training on a GPU?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags