The issue here is that there's a fixed cost overhead to each session.run
, and filling the queue with many tiny examples to the queue will be slow.
In particular, each session.run is about 100-200 usec, so you can only do about 5k-10k session.run
calls per second.
This problem is obvious if doing Python profiling (python -m cProfile), but hard to see if starting from timeline profile, or CPU profile.
The work-around is to use enqueue_many
to add things to your queue in batches. I took your benchmark from https://gist.github.com/ericyue/7705407a88e643f7ab380c6658f641e8 and modified it to enqueue many items per .run
call, and that gives 10x speed-up.
The modification is to modify tf.batch
call as follows:
if enqueue_many:
reader = tf.TFRecordReader(options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB))
queue_batch = []
for i in range(enqueue_many_size):
_, serialized_example = reader.read(filename_queue)
queue_batch.append(serialized_example)
batch_serialized_example = tf.train.shuffle_batch(
[queue_batch],
batch_size=batch_size,
num_threads=thread_number,
capacity=capacity,
min_after_dequeue=min_after_dequeue,
enqueue_many=True)
For complete source, check here:
https://github.com/yaroslavvb/stuff/blob/master/ericyue-slowreader/benchmark.py
It's hard to optimize it to go much faster since now most of the time is spent in queue operations. Looking at stripped down version which just adds integers to a queue, you also get similar speed, and looking at timeline, time is spent in dequeue ops.
Each dequeue op takes about 60 usec, but there's on average 5 runnning in parallel, so you get 12 usec per dequeue. So that means you'll get <200k examples per second in the best case.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…