Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
291 views
in Technique[技术] by (71.8m points)

python - Tensorflow: Using Parameter Servers in Distributed Training

It's not totally clear how parameter servers know what to do in a distributed tensor flow training.

For example, in this SO question, the following code is used to configure parameter server and worker tasks:

if FLAGS.job_name == "ps":
    server.join()
elif FLAGS.job_name == "worker":
    ##some training code

How does server.join() indicate the given task should be a parameter server? Is parameter serving a kind of default behavior for tasks? Is there anything else you can/should tell a parameter serving task to do?

Edit: This SO question addresses some of my question: "The logic there makes sure that Variable objects are assigned evenly to workers that act as parameter servers." But how does a parameter server know it is a parameter server? Is server.join() enough?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

TL;DR: TensorFlow doesn't know anything about "parameter servers", but instead it supports running graphs across multiple devices in different processes. Some of these processes have devices whose names start with "/job:ps", and these hold the variables. The workers drive the training process, and when they run the train_op they will cause work to happen on the "/job:ps" devices, which will update the shared variables.

The server.join() method simply tells TensorFlow to block and listen for requests until the server shuts down (which currently means it blocks forever, or until you kill the process, since clean shutdown isn't currently implemented).

In the example in my previous answer, the PS tasks are passive, and everything is controlled by the worker tasks... in ## some training code. If you split your code across multiple devices, TensorFlow will add the appropriate communication, and this extends to devices in different processes. The with tf.device(tf.train.replica_device_setter(...)): block tells TensorFlow to put each variable on a different PS task by setting its device to "/job:ps/task:{i}" (for different values of {i}, chosen in a round-robin fashion).

When you call sess.run(train_op), TensorFlow will run a graph that depends on and updates the variables, and includes the operations that update them. This part of the computation will happen on the "/job:ps" devices, so those devices will act like a parameter server.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...