Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
264 views
in Technique[技术] by (71.8m points)

python - Distributed Training with tf.estimator resulting in more training steps

I am experimenting with distributed training options on Cloud ML Engine and I observing some peculiar results. I have basically altered the census custom estimator example to contain a slightly different model and changed my loss function to AdamOptimizer as the only real changes. Based on this other thread, my understanding is that any distributed training should be data-parallel asynchronous training which would suggest "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." In my experiment, I have ~650k training examples and I am running the following experiments for 1 epoch with a batch size of 128. Given 650k training examples and a batch size of 128, I would expect there to be ~5.1k steps in an epoch. Here is the performance that I am seeing for different --scale-tier's

NOT DISTRIBUTED

  • BASIC: 8 steps/sec, 5.1k steps, 11 minute wall time
  • BASIC_GPU: 24 steps/sec, 5.1k steps, 3.5 minute wall time

DISTRIBUTED

  • STANDARD_1: 14.5 steps/sec -- 26k steps (26k*128 = ~3.3M which is way more than the training samples actually in the data), 29 min wall time

  • CUSTOM -- 5 complex_model_m workers, 2 large_model parameter servers: 27 steps/sec, 31k steps (128*31k = ~3.9M which is way more than the 650k training samples actually in the data), wall time 20 minutes

My expectation was that the data-parallel based on the article was that the distributed training would split up the batches amongst all of the workers so if I had 5 workers on ~5k batches, then each worker would perform ~1,000 batches. However, the actual behavior that I am observing is that it seems closer to EACH of the 5 workers performing 1 epoch themselves. When training in a distributed setting, there are 6x as many steps taken in an epoch as there are training examples -- I know that the true definition of a step is each time the gradients are updated, but my understanding of data parallel training is that this would just split up the batches so there should be the same number of gradient updates -- is there any reason why this behavior would be expected? Would it make sense for there to be more train steps needed in a data-parallel asynchronous training distributed environment? Can anybody explain the behavior that I am observing?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...