python - Training PyTorch models on different machines leads to different results

Question

Welcome To Ask or Share your Answers For Others

python - Training PyTorch models on different machines leads to different results

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Training PyTorch models on different machines leads to different results

I am training the same model on two different machines, but the trained models are not identical. I have taken the following measures to ensure reproducibility:

# set random number 
random.seed(0)
torch.cuda.manual_seed(0)
np.random.seed(0)

# set the cudnn
torch.backends.cudnn.benchmark=False
torch.backends.cudnn.deterministic=True

# set data loader work threads to be 0
DataLoader(dataset, num_works=0)

When I train the same model multiple times on the same machine, the trained model is always the same. However, the trained models on two different machines are not the same. Is this normal? Are there any other tricks I can employ?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:08:49+0000

There are a number of areas that could additionally introduce randomness e.g:

PyTorch random number generator

You can use torch.manual_seed() to seed the RNG for all devices (both CPU and CUDA):

CUDA convolution determinism

While disabling CUDA convolution benchmarking (discussed above) ensures that CUDA selects the same algorithm each time an application is run, that algorithm itself may be nondeterministic, unless either torch.use_deterministic_algorithms(True) or torch.backends.cudnn.deterministic = True is set. The latter setting controls only this behavior, unlike torch.use_deterministic_algorithms() which will make other PyTorch operations behave deterministically, too.

CUDA RNN and LSTM

In some versions of CUDA, RNNs and LSTM networks may have non-deterministic behavior. See torch.nn.RNN() and torch.nn.LSTM() for details and workarounds.

DataLoader

DataLoader will reseed workers following Randomness in multi-process data loading algorithm. Use worker_init_fn() to preserve reproducibility:

https://pytorch.org/docs/stable/notes/randomness.html

Categories

python - Training PyTorch models on different machines leads to different results

python - Training PyTorch models on different machines leads to different results

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

PyTorch random number generator

CUDA convolution determinism

CUDA RNN and LSTM

`DataLoader`

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags