currently I'm working on image retrieval with Python. The keypoints and descriptors extracted from an image in this example are represented as numpy.array
s. The first one of shape (2000, 5) and the latter of shape (2000, 128). Both containing only values of dtype=numpy.float32
.
So, I was wondering which format to use in order to save my extracted keypoints and descriptors. I.e. I'm always saving 2 files: one for the keypoints and one for the descriptors - this counts as one step in my measurements. I compared pickle
, cPickle
(both with protocol 0 and 2) and NumPy's binary format .pny
and the results are really confusing me:
I always thought cPickle
is supposed to be faster than the pickle
module. But especially the load time with protocol 0 really sticks out in the results.
Does anyone have an explanation for this? Is it because I'm only using numeric data? Seems strange...
PS: In my code I'm basically looping 1000 times (number=1000
) over each technique and average the measured time in the end:
timer = time.time
print 'npy save...'
t0 = timer()
for i in range(number):
numpy.save(npy_kp_path, kp)
numpy.save(npy_descr_path, descr)
t1 = timer()
results['npy']['save'] = t1 - t0
print 'npy load...'
t0 = timer()
for i in range(number):
kp = numpy.load(npy_kp_path)
descr = numpy.load(npy_descr_path)
t1 = timer()
results['npy']['load'] = t1 - t0
print 'pickle protocol 0 save...'
t0 = timer()
for i in range(number):
with open(pkl0_descr_path, 'wb') as f:
pickle.dump(descr, f, protocol=0)
with open(pkl0_kp_path, 'wb') as f:
pickle.dump(kp, f, protocol=0)
t1 = timer()
results['pkl0']['save'] = t1 - t0
print 'pickle protocol 0 load...'
t0 = timer()
for i in range(number):
with open(pkl0_descr_path, 'rb') as f:
descr = pickle.load(f)
with open(pkl0_kp_path, 'rb') as f:
kp = pickle.load(f)
t1 = timer()
results['pkl0']['load'] = t1 - t0
print 'cPickle protocol 0 save...'
t0 = timer()
for i in range(number):
with open(cpkl0_descr_path, 'wb') as f:
cPickle.dump(descr, f, protocol=0)
with open(cpkl0_kp_path, 'wb') as f:
cPickle.dump(kp, f, protocol=0)
t1 = timer()
results['cpkl0']['save'] = t1 - t0
print 'cPickle protocol 0 load...'
t0 = timer()
for i in range(number):
with open(cpkl0_descr_path, 'rb') as f:
descr = cPickle.load(f)
with open(cpkl0_kp_path, 'rb') as f:
kp = cPickle.load(f)
t1 = timer()
results['cpkl0']['load'] = t1 - t0
print 'pickle highest protocol (2) save...'
t0 = timer()
for i in range(number):
with open(pkl2_descr_path, 'wb') as f:
pickle.dump(descr, f, protocol=pickle.HIGHEST_PROTOCOL)
with open(pkl2_kp_path, 'wb') as f:
pickle.dump(kp, f, protocol=pickle.HIGHEST_PROTOCOL)
t1 = timer()
results['pkl2']['save'] = t1 - t0
print 'pickle highest protocol (2) load...'
t0 = timer()
for i in range(number):
with open(pkl2_descr_path, 'rb') as f:
descr = pickle.load(f)
with open(pkl2_kp_path, 'rb') as f:
kp = pickle.load(f)
t1 = timer()
results['pkl2']['load'] = t1 - t0
print 'cPickle highest protocol (2) save...'
t0 = timer()
for i in range(number):
with open(cpkl2_descr_path, 'wb') as f:
cPickle.dump(descr, f, protocol=cPickle.HIGHEST_PROTOCOL)
with open(cpkl2_kp_path, 'wb') as f:
cPickle.dump(kp, f, protocol=cPickle.HIGHEST_PROTOCOL)
t1 = timer()
results['cpkl2']['save'] = t1 - t0
print 'cPickle highest protocol (2) load...'
t0 = timer()
for i in range(number):
with open(cpkl2_descr_path, 'rb') as f:
descr = cPickle.load(f)
with open(cpkl2_kp_path, 'rb') as f:
kp = cPickle.load(f)
t1 = timer()
results['cpkl2']['load'] = t1 - t0
See Question&Answers more detail:
os