Let's say I have network with following params:
- fully convolutional network for semantic segmentation
- loss = weighted binary cross entropy (but it could be any loss function, doesn't matter)
- 5 classes - inputs are images and ground truths are binary masks
- Batch size = 16
Now, I know that the loss is calculated in the following manner: binary cross entropy is applied to each pixel in the image with regards to each class. So essentially, each pixel will have 5 loss values
What happens after this step?
When I train my network, it prints only a single loss value for an epoch.
There are many levels of loss accumulation that need to happen to produce a single value and how it happens is not clear at all in the docs/code.
- What gets combined first - (1) the loss values of the class(for instance 5 values(one for each class) get combined per pixel) and then all the pixels in the image or (2)all the pixels in the image for each individual class, then all the class losses are combined?
- How exactly are these different pixel combinations happening - where is it being summed / where is it being averaged?
- Keras's binary_crossentropy averages over
axis=-1
. So is this an average of all the pixels per class or average of all the classes or is it both??
To state it in a different way: how are the losses for different classes combined to produce a single loss value for an image?
This is not explained in the docs at all and would be very helpful for people doing multi-class predictions on keras, regardless of the type of network. Here is the link to the start of keras code where one first passes in the loss function.
The closest thing I could find to an explanation is
loss: String (name of objective function) or objective function. See losses. If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses
from keras. So does this mean that the losses for each class in the image is simply summed?
Example code here for someone to try it out. Here's a basic implementation borrowed from Kaggle and modified for multi-label prediction:
# Build U-Net model
num_classes = 5
IMG_DIM = 256
IMG_CHAN = 3
weights = {0: 1, 1: 1, 2: 1, 3: 1, 4: 1000} #chose an extreme value just to check for any reaction
inputs = Input((IMG_DIM, IMG_DIM, IMG_CHAN))
s = Lambda(lambda x: x / 255) (inputs)
c1 = Conv2D(8, (3, 3), activation='relu', padding='same') (s)
c1 = Conv2D(8, (3, 3), activation='relu', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)
c2 = Conv2D(16, (3, 3), activation='relu', padding='same') (p1)
c2 = Conv2D(16, (3, 3), activation='relu', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)
c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (p2)
c3 = Conv2D(32, (3, 3), activation='relu', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)
c4 = Conv2D(64, (3, 3), activation='relu', padding='same') (p3)
c4 = Conv2D(64, (3, 3), activation='relu', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)
c5 = Conv2D(128, (3, 3), activation='relu', padding='same') (p4)
c5 = Conv2D(128, (3, 3), activation='relu', padding='same') (c5)
u6 = Conv2DTranspose(64, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(64, (3, 3), activation='relu', padding='same') (u6)
c6 = Conv2D(64, (3, 3), activation='relu', padding='same') (c6)
u7 = Conv2DTranspose(32, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(32, (3, 3), activation='relu', padding='same') (u7)
c7 = Conv2D(32, (3, 3), activation='relu', padding='same') (c7)
u8 = Conv2DTranspose(16, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(16, (3, 3), activation='relu', padding='same') (u8)
c8 = Conv2D(16, (3, 3), activation='relu', padding='same') (c8)
u9 = Conv2DTranspose(8, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(8, (3, 3), activation='relu', padding='same') (u9)
c9 = Conv2D(8, (3, 3), activation='relu', padding='same') (c9)
outputs = Conv2D(num_classes, (1, 1), activation='sigmoid') (c9)
model = Model(inputs=[inputs], outputs=[outputs])
model.compile(optimizer='adam', loss=weighted_loss(weights), metrics=[mean_iou])
def weighted_loss(weightsList):
def lossFunc(true, pred):
axis = -1 #if channels last
#axis= 1 #if channels first
classSelectors = K.argmax(true, axis=axis)
classSelectors = [K.equal(tf.cast(i, tf.int64), tf.cast(classSelectors, tf.int64)) for i in range(len(weightsList))]
classSelectors = [K.cast(x, K.floatx()) for x in classSelectors]
weights = [sel * w for sel,w in zip(classSelectors, weightsList)]
weightMultiplier = weights[0]
for i in range(1, len(weights)):
weightMultiplier = weightMultiplier + weights[i]
loss = BCE_loss(true, pred) - (1+dice_coef(true, pred))
loss = loss * weightMultiplier
return loss
return lossFunc
model.summary()
The actual BCE-DICE loss function can be found here.
Motivation for the question: Based on the above code, the total validation loss of the network after 20 epochs is ~1%; however, the mean intersection over union scores for the first 4 classes are above 95% each, but for the last class its 23%. Clearly indicating that the 5th class isn't doing well at all. However, this loss in accuracy isn't being reflected at all in the loss. Hence, that means the individual losses for the sample are being combined in a way that completely negates the huge loss we see for the 5th class. And, so when the per sample losses are being combined over batch, it's still really low. I'm not sure how to reconcile this information.
See Question&Answers more detail:
os