I have a TensorFlow/Keras model that I am training on a synthetic classification task.
When training the model using my laptop, the model achieves 99.9% accuracy and loss values around 1e-8.
However, when I train the model on a different machine, the accuracy plateaus at 80% and the loss is stuck at 3e-1. I have reproduced the failure on my own server and Google Colab.
Now, since the issue appears to be that my laptop is configured differently, I am trying to find out what this difference is.
I have made sure that on both machines:
- Python version is 3.7
- Nvidia driver is 460.x.x
- CUDA version is 11.2
- Tensorflow version is 2.4 and is installed from pip
- Numpy version is 1.19.5
- Scipy version is 1.4.1
The laptop has an i7-7700HQ and a NVIDIA GeForce GTX 1050 Mobile.
The server has a Xeon Silver 4116 and several GPUs: TITAN Xp, TITAN V, GeForce RTX 2080 SUPER, TITAN V (I have tried all of them).
The problem happens both on CPU and GPU. Precision is set to float32 in all cases.
The code that is being run is exactly the same.
I cannot share the code, but I can say that it uses tf.math.segment_sum
which is a non-deterministic op (I don't know if it may help).
I am at a complete loss here. I have tried looking at every possible discrepancy between the two configurations, but I could not find any. The fact that this issue happens also on CPU is what really blows my mind.
What could the problem be?
I hope this qualifies as a programming question since it's related to TensorFlow specifically. If not, I apologize in advance and will ask elsewhere.
Thanks
question from:
https://stackoverflow.com/questions/66050281/tensorflow-model-can-only-achieve-good-results-on-one-computer-fails-everywhere 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…