Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
650 views
in Technique[技术] by (71.8m points)

python - Handle invalid/corrupted image files in ImageDataGenerator.flow_from_directory in Keras

I am using Python with Keras and running ImageDataGenerator and using flow_from_directory. I have some problematic image files, so can I use the data generator in order to handle the read errors?

I am getting some "not valid jpg file" on a small portion of the images and would like to treat this without my code crashing.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Well, one solution is to modify the ImageDataGenerator code and put error handling mechanism (i.e. try/except) in it.

However, one alternative is to wrap your generator inside another generator and use try/except there. The disadvantage of this solution is that it throws away the whole generated batch even if one single image is corrupted in that batch (this may mean that it is possible that some of the samples may not be used for training at all):

data_gen = ImageDataGenerator(...)

train_gen = data_gen.flow_from_directory(...)

def my_gen(gen):
    while True:
        try:
            data, labels = next(gen)
            yield data, labels
        except:
            pass

# ... define your model and compile it

# fit the model
model.fit_generator(my_gen(train_gen), ...)

Another disadvantage of this solution is that since you need to specify the number of steps of generator (i.e. steps_per_epoch) and considering that a batch may be thrown away in a step and a new batch is fetched instead in the same step, you may end up training on some of the samples more than once in an epoch. This may or may not have significant effects depending on how many batches include corrupted images (i.e. if there are a few, then there is nothing to be worried about that much).

Finally, note that you may want to use the newer Keras data-generator i.e. Sequence class to read images one by one in the __getitem__ method in each batch and discard corrupted ones. However, the problem of the previous approach, i.e. training on some of the images more than once, is still present in this approach as well since you also need to implement the __len__ method and it is essentially equivalent to steps_per_epoch argument. Although, in my opinion, this approach (i.e. subclassing Sequence class) is superior to the above approach (of course, if you put aside the fact that you may need to write more code) and have fewer side effects (since you can discard a single image and not the whole batch).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...