As Lucas Ramos already mentioned, when using DataLoader
where the underlying dataset's size is not divisible by the batch size, the default behavior is to have a smaller last batch:
drop_last
(bool, optional) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False
)
Your plan is basically implementing gradient accumulation combined with drop_last=False
- that is having the last batch smaller than all others.
Therefore, in principle there's nothing wrong with training with varying batch sizes.
However, there is something you need to fix in your code:
The loss is averaged over the mini-batch. So, if you process mini batches in the usual way you do not need to worry about it. However, when accumulating gradients you do it explicitly by dividing the loss by iters_to_accumulate
:
loss = loss / iters_to_accumulate
In the last mini batch (with smaller size) you need to change the value of iter_to_accumulate
to reflect this smaller minibatch size!
I proposed this revised code, breaking the training loop into two: an outer loop on mini-batches, and an inner one that accumulates gradients per mini batch. Note how using an iter
over the DataLoader
helps breaking the training loop into two:
scaler = GradScaler()
for epoch in epochs:
bi = 0 # index batches
# outer loop over minibatches
data_iter = iter(data)
while bi < len(data):
# determine the range for this batch
nbi = min(len(data), bi + iters_to_accumulate)
# inner loop over the items of the mini batch - accumulating gradients
for i in range(bi, nbi):
input, target = data_iter.next()
with autocast():
output = model(input)
loss = loss_fn(output, target)
loss = loss / (nbi - bi) # divide by the true batch size
# Accumulates scaled gradients.
scaler.scale(loss).backward()
# done mini batch loop - gradients were accumulated, we can make an optimizatino step.
# may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
bi = nbi
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…