I am trying to do something fairly simple, reading a large csv file into a pandas dataframe.
data = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2)
The code either fails with a MemoryError
, or just never finishes.
Mem usage in the task manager stopped at 506 Mb and after 5 minutes of no change and no CPU activity in the process I stopped it.
I am using pandas version 0.11.0.
I am aware that there used to be a memory problem with the file parser, but according to http://wesmckinney.com/blog/?p=543 this should have been fixed.
The file I am trying to read is 366 Mb, the code above works if I cut the file down to something short (25 Mb).
It has also happened that I get a pop up telling me that it can't write to address 0x1e0baf93...
Stacktrace:
Traceback (most recent call last):
File "F:QA ALMPython
ew WIM data
ew WIM data
ew_WIM_data.py", line 25, in
<module>
wimdata = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2
)
File "C:Program FilesPythonAnacondalibsite-packagespandasioparsers.py"
, line 401, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:Program FilesPythonAnacondalibsite-packagespandasioparsers.py"
, line 216, in _read
return parser.read()
File "C:Program FilesPythonAnacondalibsite-packagespandasioparsers.py"
, line 643, in read
df = DataFrame(col_dict, columns=columns, index=index)
File "C:Program FilesPythonAnacondalibsite-packagespandascoreframe.py"
, line 394, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "C:Program FilesPythonAnacondalibsite-packagespandascoreframe.py"
, line 525, in _init_dict
dtype=dtype)
File "C:Program FilesPythonAnacondalibsite-packagespandascoreframe.py"
, line 5338, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "C:Program FilesPythonAnacondalibsite-packagespandascoreinternals
.py", line 1820, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "C:Program FilesPythonAnacondalibsite-packagespandascoreinternals
.py", line 1872, in form_blocks
float_blocks = _multi_blockify(float_items, items)
File "C:Program FilesPythonAnacondalibsite-packagespandascoreinternals
.py", line 1930, in _multi_blockify
block_items, values = _stack_arrays(list(tup_block), ref_items, dtype)
File "C:Program FilesPythonAnacondalibsite-packagespandascoreinternals
.py", line 1962, in _stack_arrays
stacked = np.empty(shape, dtype=dtype)
MemoryError
Press any key to continue . . .
A bit of background - I am trying to convince people that Python can do the same as R. For this I am trying to replicate an R script that does
data <- read.table(paste(INPUTDIR,config[i,]$TOEXTRACT,sep=""), HASHEADER, DELIMITER,skip=2,fill=TRUE)
R not only manages to read the above file just fine, it even reads several of these files in a for loop (and then does some stuff with the data). If Python does have a problem with files of that size I might be fighting a loosing battle...
Question&Answers:
os