I am experimenting with distributed training options on Cloud ML Engine and I observing some peculiar results. I have basically altered the census custom estimator example to contain a slightly different model and changed my loss function to AdamOptimizer as the only real changes. Based on this other thread, my understanding is that any distributed training should be data-parallel asynchronous training which would suggest "If you distribute 10,000 batches among 10 worker nodes, each node works on roughly 1,000 batches." In my experiment, I have ~650k training examples and I am running the following experiments for 1 epoch with a batch size of 128. Given 650k training examples and a batch size of 128, I would expect there to be ~5.1k steps in an epoch. Here is the performance that I am seeing for different --scale-tier
's
NOT DISTRIBUTED
- BASIC: 8 steps/sec, 5.1k steps, 11 minute wall time
- BASIC_GPU: 24 steps/sec, 5.1k steps, 3.5 minute wall time
DISTRIBUTED
STANDARD_1: 14.5 steps/sec -- 26k steps (26k*128 = ~3.3M which is way more than the training samples actually in the data), 29 min wall time
CUSTOM -- 5 complex_model_m workers, 2 large_model parameter servers: 27 steps/sec, 31k steps (128*31k = ~3.9M which is way more than the 650k training samples actually in the data), wall time 20 minutes
My expectation was that the data-parallel based on the article was that the distributed training would split up the batches amongst all of the workers so if I had 5 workers on ~5k batches, then each worker would perform ~1,000 batches. However, the actual behavior that I am observing is that it seems closer to EACH of the 5 workers performing 1 epoch themselves. When training in a distributed setting, there are 6x as many steps taken in an epoch as there are training examples -- I know that the true definition of a step is each time the gradients are updated, but my understanding of data parallel training is that this would just split up the batches so there should be the same number of gradient updates -- is there any reason why this behavior would be expected? Would it make sense for there to be more train steps needed in a data-parallel asynchronous training distributed environment? Can anybody explain the behavior that I am observing?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…