I am working with a large dataset in CSV format. I am trying to process the data column-by-column, then append the data to a frame in an HDF file. All of this is done using Pandas. My motivation is that, while the entire dataset is much bigger than my physical memory, the column size is managable. At a later stage I will be performing feature-wise logistic regression by loading the columns back into memory one by one and operating on them.
I am able to make a new HDF file and make a new frame with the first column:
hdf_file = pandas.HDFStore('train_data.hdf')
feature_column = pandas.read_csv('data.csv', usecols=[0])
hdf_file.append('features', feature_column)
But after that, I get a ValueError when trying to append a new column to the frame:
feature_column = pandas.read_csv('data.csv', usecols=[1])
hdf_file.append('features', feature_column)
Stack trace and error message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 658, in append self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 923, in _write_to_group s.write(obj = value, append=append, complib=complib, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2985, in write **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2675, in create_axes raise ValueError("cannot match existing table structure for [%s] on appending data" % items)
ValueError: cannot match existing table structure for [srch_id] on appending data
I am new to working with large datasets and limited memory, so I am open to suggestions for alternate ways to work with this data.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…