Use append=True
in the call to to_hdf
:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
# A B
# 0 0 1
# 1 2 3
# 2 4 5
# 3 6 7
# 4 8 9
# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df # allow df to be garbage collected
# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)
print(pd.read_hdf(filename, 'data'))
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
Note that you need to use format='table'
in the first call to df.to_hdf
to make the table appendable. Otherwise, the format is 'fixed'
by default, which is faster for reading and writing, but creates a table which can not be appended to.
Thus, you can process each CSV one at a time, use append=True
to build the hdf5 file. Then overwrite the DataFrame or use del df
to allow the old DataFrame to be garbage collected.
Alternatively, instead of calling df.to_hdf
, you could append to a HDFStore:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
store = pd.HDFStore(filename)
for i in range(2):
df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
store.append('data', df)
store.close()
store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…