python - PyTorch - Convert CIFAR dataset to `TensorDataset`

Question

Welcome To Ask or Share your Answers For Others

python - PyTorch - Convert CIFAR dataset to `TensorDataset`

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - PyTorch - Convert CIFAR dataset to `TensorDataset`

I train ResNet34 on CIFAR dataset. For a certain reason, I need to convert the dataset into TensorDataset. My solution is based on this: https://stackoverflow.com/a/44475689/15072863 with some differences (maybe they are critical, but I don't see why). It looks I'm not doing this correctly.

Train loader:

transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))])

train_ds = torchvision.datasets.CIFAR10('/files/', train=True, transform=transform_train, download=True)

xs, ys = [], []
for x, y in train_ds:
  xs.append(x)
  ys.append(y)

# 1) Standard Version
# cifar_train_loader = DataLoader(train_ds, batch_size=batch_size_train, shuffle=True, num_workers=num_workers)

# 2) TensorDataset version, seems to be incorrect
cifar_tensor_ds = TensorDataset(torch.stack(xs), torch.tensor(ys, dtype=torch.long))
cifar_train_loader = DataLoader(cifar_tensor_ds, batch_size=batch_size_train, shuffle=True, num_workers=num_workers)

I don't think it matters, but test loader is defined as usual:

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

cifar_test_loader = DataLoader(
  torchvision.datasets.CIFAR10('/files/', train=False, transform=transform_test, download=True),
  batch_size=batch_size_test, shuffle=False, num_workers=num_workers)

I know that something is wrong with how I use TensorDataset, since;

With TensorDataset I achieve 100% train accuracy, 80% test accuracy
With standard Dataset I achieve 99% train accuracy (never 100%), 90% test accuracy.

So, what am I doing wrong?

P.S.: My final goal is to split the dataset into 10 datasets based on their class. Is there a better way to do this? Of course, I can define my subclass of DataSet, but manually splitting it and creating TensorDataset's seemed to be simpler.

question from:https://stackoverflow.com/questions/65925371/pytorch-convert-cifar-dataset-to-tensordataset

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:07:00+0000

When using the "standard" dataset, each time you load an image, a random transform (flip + crop) is applied to it. As a consequence, virtually every image of every epoch is unique, seen only once. So you kind of have nb_epochs * len(dataset) different inputs.

With your custom dataset, you first read all the images of the CIFAR dset (each of them with a random transform), store them all, and then use the stored tensor as your training inputs. Thus at each epoch, the network sees exactly the same inputs

Since the network was already able to achieve great accuracy with the random transformations, removing it makes it even easier and thus it further improves the accuracy

Oh, and you should definitely redefine you own subclass of Dataset. It's not even complicated, and it will be much easier to work with. You just need to extract the 10 different datasets, either by manually moving the images in their folders or using some reindexing arrays or something like that. Either way, you will only have to do it once, so not big deal

Categories

python - PyTorch - Convert CIFAR dataset to `TensorDataset`

python - PyTorch - Convert CIFAR dataset to `TensorDataset`

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags