I'm trying to use the TensorFlow Dataset API to read an HDF5 file, using the from_generator
method. Everything works fine unless the batch size does not evenly divide into the number of events. I don't quite see how to make a flexible batch using the API.
If things don't divide evenly, you get errors like:
2018-08-31 13:47:34.274303: W tensorflow/core/framework/op_kernel.cc:1263] Invalid argument: ValueError: `generator` yielded an element of shape (1, 28, 28, 1) where an element of shape (11, 28, 28, 1) was expected.
Traceback (most recent call last):
File "/Users/perdue/miniconda3/envs/py3a/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 206, in __call__
ret = func(*args)
File "/Users/perdue/miniconda3/envs/py3a/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 452, in generator_py_func
"of shape %s was expected." % (ret_array.shape, expected_shape))
ValueError: `generator` yielded an element of shape (1, 28, 28, 1) where an element of shape (11, 28, 28, 1) was expected.
I have a script that reproduces the error (and instructions to get the several MB required data file - Fashion MNIST) here:
https://gist.github.com/gnperdue/b905a9c2dd4c08b53e0539d6aa3d3dc6
The most important code is probably:
def make_fashion_dset(file_name, batch_size, shuffle=False):
dgen = _make_fashion_generator_fn(file_name, batch_size)
features_shape = [batch_size, 28, 28, 1]
labels_shape = [batch_size, 10]
ds = tf.data.Dataset.from_generator(
dgen, (tf.float32, tf.uint8),
(tf.TensorShape(features_shape), tf.TensorShape(labels_shape))
)
...
where dgen
is a generator function reading from the hdf5:
def _make_fashion_generator_fn(file_name, batch_size):
reader = FashionHDF5Reader(file_name)
nevents = reader.openf()
def example_generator_fn():
start_idx, stop_idx = 0, batch_size
while True:
if start_idx >= nevents:
reader.closef()
return
yield reader.get_examples(start_idx, stop_idx)
start_idx, stop_idx = start_idx + batch_size, stop_idx + batch_size
return example_generator_fn
The core of the problem is we have to declare the tensor shapes in from_generator
, but we need the flexibility to change that shape down the line while iterating.
There are some workarounds - drop the last few samples to get even division, or just use a batch size of 1... but the first is bad if you can't lose any samples and a batch size of 1 is very slow.
Any ideas or comments? Thanks!
See Question&Answers more detail:
os