python - Tensorflow: Using Parameter Servers in Distributed Training

Question

Welcome To Ask or Share your Answers For Others

python - Tensorflow: Using Parameter Servers in Distributed Training

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Tensorflow: Using Parameter Servers in Distributed Training

It's not totally clear how parameter servers know what to do in a distributed tensor flow training.

For example, in this SO question, the following code is used to configure parameter server and worker tasks:

if FLAGS.job_name == "ps":
    server.join()
elif FLAGS.job_name == "worker":
    ##some training code

How does server.join() indicate the given task should be a parameter server? Is parameter serving a kind of default behavior for tasks? Is there anything else you can/should tell a parameter serving task to do?

Edit: This SO question addresses some of my question: "The logic there makes sure that Variable objects are assigned evenly to workers that act as parameter servers." But how does a parameter server know it is a parameter server? Is server.join() enough?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T02:50:02+0000

TL;DR: TensorFlow doesn't know anything about "parameter servers", but instead it supports running graphs across multiple devices in different processes. Some of these processes have devices whose names start with "/job:ps", and these hold the variables. The workers drive the training process, and when they run the train_op they will cause work to happen on the "/job:ps" devices, which will update the shared variables.

The server.join() method simply tells TensorFlow to block and listen for requests until the server shuts down (which currently means it blocks forever, or until you kill the process, since clean shutdown isn't currently implemented).

In the example in my previous answer, the PS tasks are passive, and everything is controlled by the worker tasks... in ## some training code. If you split your code across multiple devices, TensorFlow will add the appropriate communication, and this extends to devices in different processes. The with tf.device(tf.train.replica_device_setter(...)): block tells TensorFlow to put each variable on a different PS task by setting its device to "/job:ps/task:{i}" (for different values of {i}, chosen in a round-robin fashion).

When you call sess.run(train_op), TensorFlow will run a graph that depends on and updates the variables, and includes the operations that update them. This part of the computation will happen on the "/job:ps" devices, so those devices will act like a parameter server.

Categories

python - Tensorflow: Using Parameter Servers in Distributed Training

python - Tensorflow: Using Parameter Servers in Distributed Training

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags