The build-in .npy
file format is perfectly fine for working with small datasets, without relying on external modules other then numpy
.
However, when you start having large amounts of data, the use of a file format, such as HDF5, designed to handle such datasets, is to be preferred [1].
For instance, below is a solution to save numpy
arrays in HDF5 with PyTables,
Step 1: Create an extendable EArray
storage
import tables
import numpy as np
filename = 'outarray.h5'
ROW_SIZE = 100
NUM_COLUMNS = 200
f = tables.open_file(filename, mode='w')
atom = tables.Float64Atom()
array_c = f.create_earray(f.root, 'data', atom, (0, ROW_SIZE))
for idx in range(NUM_COLUMNS):
x = np.random.rand(1, ROW_SIZE)
array_c.append(x)
f.close()
Step 2: Append rows to an existing dataset (if needed)
f = tables.open_file(filename, mode='a')
f.root.data.append(x)
Step 3: Read back a subset of the data
f = tables.open_file(filename, mode='r')
print(f.root.data[1:10,2:20]) # e.g. read from disk only this part of the dataset
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…