Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
322 views
in Technique[技术] by (71.8m points)

python - Tensorflow: Multiple loss functions vs Multiple training ops

I am creating a Tensorflow model which predicts multiple outputs (with different activations). I think there are two ways to do this:

Method 1: Create multiple loss functions (one for each output), merge them (using tf.reduce_mean or tf.reduce_sum) and pass it to the training op like so:

final_loss = tf.reduce_mean(loss1 + loss2)
train_op = tf.train.AdamOptimizer().minimize(final_loss)

Method 2: Create multiple training operations and then group them like so:

train_op1 = tf.train.AdamOptimizer().minimize(loss1)
train_op2 = tf.train.AdamOptimizer().minimize(loss2)
final_train_op = tf.group(train_op1 train_op2)

My question is whether one method is advantageous over the other? Is there a third method I don't know?

Thanks

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I want to make a subtle point that I don't think was made in previous answers.

If you were using something like GradientDescentOptimizer, these would be very similar operations. That's because taking gradients is a linear operation, and the gradient of a sum is the same as the sum of the gradients.

But, ADAM does something special: regardless of the scale of your loss, it scales the gradients so that they're always on the order of your learning rate. If you multiplied your loss by 1000, it wouldn't affect ADAM, because the change it would be normalized away.

So, if your two losses are roughly the same magnitude, then it shouldn't make a difference. If one is much larger than the other, then keep in mind that summing before the minimization will essentially ignore the small one, while making two ops will spend equal effort minimizing both.

I personally like dividing them up, which gives you more control over how much to focus on one loss or the other. For example, if it was multi-task learning, and one task was more important to get right than the other, two ops with different learning rates roughly accomplishes this.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...