TL;DR: TensorFlow doesn't know anything about "parameter servers", but instead it supports running graphs across multiple devices in different processes. Some of these processes have devices whose names start with "/job:ps"
, and these hold the variables. The workers drive the training process, and when they run the train_op
they will cause work to happen on the "/job:ps"
devices, which will update the shared variables.
The server.join()
method simply tells TensorFlow to block and listen for requests until the server shuts down (which currently means it blocks forever, or until you kill the process, since clean shutdown isn't currently implemented).
In the example in my previous answer, the PS tasks are passive, and everything is controlled by the worker tasks... in ## some training code
. If you split your code across multiple devices, TensorFlow will add the appropriate communication, and this extends to devices in different processes. The with tf.device(tf.train.replica_device_setter(...)):
block tells TensorFlow to put each variable on a different PS task by setting its device to "/job:ps/task:{i}"
(for different values of {i}
, chosen in a round-robin fashion).
When you call sess.run(train_op)
, TensorFlow will run a graph that depends on and updates the variables, and includes the operations that update them. This part of the computation will happen on the "/job:ps"
devices, so those devices will act like a parameter server.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…