I've been using pandas for research now for about two months to great effect. With large numbers of medium-sized trace event datasets, pandas + PyTables (the HDF5 interface) does a tremendous job of allowing me to process heterogenous data using all the Python tools I know and love.
Generally speaking, I use the Fixed (formerly "Storer") format in PyTables, as my workflow is write-once, read-many, and many of my datasets are sized such that I can load 50-100 of them into memory at a time with no serious disadvantages. (NB: I do much of my work on Opteron server-class machines with 128GB+ system memory.)
However, for large datasets (500MB and greater), I would like to be able to use the more scalable random-access and query abilities of the PyTables "Tables" format, so that I can perform my queries out-of-memory and then load the much smaller result set into memory for processing. The big hurdle here, however, is the write performance. Yes, as I said, my workflow is write-once, read-many, but the relative times are still unacceptable.
As an example, I recently ran a large Cholesky factorization that took 3 minutes, 8 seconds (188 seconds) on my 48 core machine. This generated a trace file of ~2.2 GB - the trace is generated in parallel with the program, so there is no additional "trace creation time."
The initial conversion of my binary trace file into the pandas/PyTables format takes a decent chunk of time, but largely because the binary format is deliberately out-of-order in order to reduce the performance impact of the trace generator itself. This is also irrelevant to the performance loss when moving from the Storer format to the Table format.
My tests were initially run with pandas 0.12, numpy 1.7.1, PyTables 2.4.0, and numexpr 0.20.1. My 48 core machine runs at 2.8GHz per core, and I am writing to an ext3 filesystem which is probably (but not certainly) on a SSD.
I can write the entire dataset to a Storer format HDF5 file (resulting filesize: 3.3GB) in 7.1 seconds. The same dataset, written to the Table format (resulting file size is also 3.3GB), takes 178.7 seconds to write.
The code is as follows:
with Timer() as t:
store = pd.HDFStore('test_storer.h5', 'w')
store.put('events', events_dataset, table=False, append=False)
print('Fixed format write took ' + str(t.interval))
with Timer() as t:
store = pd.HDFStore('test_table.h5', 'w')
store.put('events', events_dataset, table=True, append=False)
print('Table format write took ' + str(t.interval))
and the output is simply
Fixed format write took 7.1
Table format write took 178.7
My dataset has 28,880,943 rows, and the columns are basic datatypes:
node_id int64
thread_id int64
handle_id int64
type int64
begin int64
end int64
duration int64
flags int64
unique_id int64
id int64
DSTL_LS_FULL float64
L2_DMISS float64
L3_MISS float64
kernel_type float64
dtype: object
...so I don't think there should be any data-specific issues with the write speed.
I've also tried adding BLOSC compression, to rule out any strange I/O issues that might affect one scenario or the other, but compression seems to decrease the performance of both equally.
Now, I realize that the pandas documentation says that the Storer format offers significantly faster writes, and slightly faster reads. (I do experience the faster reads, as a read of the Storer format seems to take around 2.5 seconds, while a read of the Table format takes around 10 seconds.) But it really seems excessive that the Table format write should take 25 times as long as the Storer format write.
Can any of the folks involved with PyTables or pandas explain the architectural (or otherwise) reasons why writing to the queryable format (which clearly requires very little extra data) should take an order of magnitude longer? And is there any hope for improving this in the future? I'd love to jump in to contributing to one project or the other, as my field is high performance computing and I see a significant use case for both projects in this domain.... but it would be helpful to get some clarification on the issues involved first, and/or some advice on how to speed things up from those who know how the system is built.
EDIT:
Running the former tests with %prun in IPython gives the following (somewhat reduced for readability) profile output for the Storer/Fixed format:
%prun -l 20 profile.events.to_hdf('test.h5', 'events', table=False, append=False)
3223 function calls (3222 primitive calls) in 7.385 seconds
Ordered by: internal time
List reduced from 208 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
6 7.127 1.188 7.128 1.188 {method '_createArray' of 'tables.hdf5Extension.Array' objects}
1 0.242 0.242 0.242 0.242 {method '_closeFile' of 'tables.hdf5Extension.File' objects}
1 0.003 0.003 0.003 0.003 {method '_g_new' of 'tables.hdf5Extension.File' objects}
46 0.001 0.000 0.001 0.000 {method 'reduce' of 'numpy.ufunc' objects}
and the following for the Tables format:
%prun -l 40 profile.events.to_hdf('test.h5', 'events', table=True, append=False, chunksize=1000000)
499082 function calls (499040 primitive calls) in 188.981 seconds
Ordered by: internal time
List reduced from 526 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
29 92.018 3.173 92.018 3.173 {pandas.lib.create_hdf_rows_2d}
640 20.987 0.033 20.987 0.033 {method '_append' of 'tables.hdf5Extension.Array' objects}
29 19.256 0.664 19.256 0.664 {method '_append_records' of 'tables.tableExtension.Table' objects}
406 19.182 0.047 19.182 0.047 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
14244 10.646 0.001 10.646 0.001 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
472 10.359 0.022 10.359 0.022 {method 'copy' of 'numpy.ndarray' objects}
80 3.409 0.043 3.409 0.043 {tables.indexesExtension.keysort}
2 3.023 1.512 3.023 1.512 common.py:134(_isnull_ndarraylike)
41 2.489 0.061 2.533 0.062 {method '_fillCol' of 'tables.tableExtension.Row' objects}
87 2.401 0.028 2.401 0.028 {method 'astype' of 'numpy.ndarray' objects}
30 1.880 0.063 1.880 0.063 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
282 0.824 0.003 0.824 0.003 {method 'reduce' of 'numpy.ufunc' objects}
41 0.537 0.013 0.668 0.016 index.py:607(final_idx32)
14490 0.385 0.000 0.712 0.000 array.py:342(_interpret_indexing)
39 0.279 0.007 19.635 0.503 index.py:1219(reorder_slice)
2 0.256 0.128 10.063 5.031 index.py:1099(get_neworder)
1 0.090 0.090 119.392 119.392 pytables.py:3016(write_data)
57842 0.087 0.000 0.087 0.000 {numpy.core.multiarray.empty}
28570 0.062 0.000 0.107 0.000 utils.py:42(is_idx)
14164 0.062 0.000 7.181 0.001 array.py:711(_readSlice)
EDIT 2:
Running again with a pre-release copy of pandas 0.13 (pulled Nov 20 2013 at about 11:00 EST), write times for the Tables format improve significantly but still don't compare "reasonably" to the write speeds of the Storer/Fixed format.
%prun -l 40 profile.events.to_hdf('test.h5', 'events', table=True, append=False, chunksize=1000000)
499748 function calls (499720 primitive calls) in 117.187 seconds
Ordered by: internal time
List reduced from 539 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
640 22.010 0.034 22.010 0.034 {method '_append' of 'tables.hdf5Extension.Array' objects}
29 20.782 0.717 20.782 0.717 {method '_append_records' of 'tables.tableExtension.Table' objects}
406 19.248 0.047 19.248 0.047 {method '_g_writeSlice' of 'tables.hdf5Extension.Array' objects}
14244 10.685 0.001 10.685 0.001 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects}
472 10.439 0.022 10.439 0.022 {method 'copy' of 'numpy.ndarray' objects}
30 7.356 0.245 7.356 0.245 {method '_g_flush' of 'tables.hdf5Extension.Leaf' objects}
29 7.161 0.247 37.609 1.297 pytables.py:3498(write_data_chunk)
2 3.888 1.944 3.888 1.944 common.py:197(_isnull_ndarraylike)
80 3.581 0.045 3.581 0.045 {tables.indexesExtension.keysort}
41 3.248 0.079 3.294 0.080 {method '_fillCol' of 'tables.tableExtension.Row' objects}
34 2.744 0.081 2.744 0.081 {method 'ravel' of 'numpy.ndarray' objects}
115 2.591 0.023 2.591 0.023 {method 'astype' of 'numpy.ndarray' objects}
270 0.875 0.003 0.875 0.003 {method 'reduce' of 'numpy.ufunc' objects}
41 0.560 0.014 0.732 0.018 index.py:607(final_idx32)
14490 0.387 0.000 0.712 0.000 array.py:342(_interpret_indexing)
39 0.303 0.008 19.617 0.503 index.py:1219(reorder_slice)
2 0.288 0.144 10.299 5.149 index.py:1099(get_neworder)
57871 0.087 0.000 0.087 0.000 {numpy.core.multiarray.empty}
1 0.084 0.084 45.266 45.266 pytables.py:3424(write_data)
1 0.080 0.080 55.542 55.542 pytables.py:3385(write)
I noticed while running these tests that there are long periods where writing seems to "pause" (the file on disk is not actively growing), and yet there is also low CPU usage during some of these periods.
I begin to suspect that some known ext3 limitations may interact badly with either pandas or PyTables. Ext3 and other non-extent-based filesystems sometimes struggle to unlink large files promptly, and similar system performance (low CPU usage, but long wait times) is apparent even during a simple 'rm' of a 1GB file, for instance.
To clarify, in each test case, I made sure to remove the existing file, if any, before starting the test, so as not to incur any ext3 file removal/overwrite penalty.
However, when re-running this test with index=None, performance improves drastically (~50s vs the ~120 when indexing). So it would seem that either this process continues to be CPU-bound (my system has relatively old AMD Opteron Istanbul CPUs running @ 2.8GHz, though it does also have 8 sockets with 6 core CPUs in each, all but one of which, of course, sit idle during the write), or that there is some conflict between the way PyTables or pandas attempts to manipulate/read/analyze the file when already partially or fully on the filesystem that causes pathologically bad I/O behavior when the indexing is occurring.
EDIT 3:
@Jeff's suggested tests on a smaller dataset (1.3 GB on disk), after upgrading PyTables from 2.4 to 3.0.0, have gotten me here:
In [7]: %timeit f(df)
1 loops, best of 3: 3.7 s per loop
In [8]: %timeit f2(df) # where chunksize= 2 000 000
1 loops, best of 3: 13.8 s per loop
In [9]: %timeit f3(df) # where chunksize= 2 000 000
1 loops, best of 3: 43.4 s per loop
In fact, my performance seems to beat his in all scenarios except for when indexi