Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
249 views
in Technique[技术] by (71.8m points)

python - Memory error when using pandas read_csv

I am trying to do something fairly simple, reading a large csv file into a pandas dataframe.

data = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2)

The code either fails with a MemoryError, or just never finishes.

Mem usage in the task manager stopped at 506 Mb and after 5 minutes of no change and no CPU activity in the process I stopped it.

I am using pandas version 0.11.0.

I am aware that there used to be a memory problem with the file parser, but according to http://wesmckinney.com/blog/?p=543 this should have been fixed.

The file I am trying to read is 366 Mb, the code above works if I cut the file down to something short (25 Mb).

It has also happened that I get a pop up telling me that it can't write to address 0x1e0baf93...

Stacktrace:

Traceback (most recent call last):
  File "F:QA ALMPython
ew WIM data
ew WIM data
ew_WIM_data.py", line 25, in
 <module>
    wimdata = pandas.read_csv(filepath, header = 0, sep = DELIMITER,skiprows = 2
)
  File "C:Program FilesPythonAnacondalibsite-packagespandasioparsers.py"
, line 401, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:Program FilesPythonAnacondalibsite-packagespandasioparsers.py"
, line 216, in _read
    return parser.read()
  File "C:Program FilesPythonAnacondalibsite-packagespandasioparsers.py"
, line 643, in read
    df = DataFrame(col_dict, columns=columns, index=index)
  File "C:Program FilesPythonAnacondalibsite-packagespandascoreframe.py"
, line 394, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "C:Program FilesPythonAnacondalibsite-packagespandascoreframe.py"
, line 525, in _init_dict
    dtype=dtype)
  File "C:Program FilesPythonAnacondalibsite-packagespandascoreframe.py"
, line 5338, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "C:Program FilesPythonAnacondalibsite-packagespandascoreinternals
.py", line 1820, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
  File "C:Program FilesPythonAnacondalibsite-packagespandascoreinternals
.py", line 1872, in form_blocks
    float_blocks = _multi_blockify(float_items, items)
  File "C:Program FilesPythonAnacondalibsite-packagespandascoreinternals
.py", line 1930, in _multi_blockify
    block_items, values = _stack_arrays(list(tup_block), ref_items, dtype)
  File "C:Program FilesPythonAnacondalibsite-packagespandascoreinternals
.py", line 1962, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)
MemoryError
Press any key to continue . . .

A bit of background - I am trying to convince people that Python can do the same as R. For this I am trying to replicate an R script that does

data <- read.table(paste(INPUTDIR,config[i,]$TOEXTRACT,sep=""), HASHEADER, DELIMITER,skip=2,fill=TRUE)

R not only manages to read the above file just fine, it even reads several of these files in a for loop (and then does some stuff with the data). If Python does have a problem with files of that size I might be fighting a loosing battle...

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Windows memory limitation

Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play with by default.

Tricks for lowering memory usage

If you are not using 32bit python in windows but are looking to improve on your memory efficiency while reading csv files, there is a trick.

The pandas.read_csv function takes an option called dtype. This lets pandas know what types exist inside your csv data.

How this works

By default, pandas will try to guess what dtypes your csv file has. This is a very heavy operation because while it is determining the dtype, it has to keep all raw data as objects (strings) in memory.

Example

Let's say your csv looks like this:

name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01

This example is of course no problem to read into memory, but it's just an example.

If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess.

I think the default in pandas is to read 1,000,000 rows before guessing the dtype.

Solution

By specifying dtype={'age':int} as an option to the .read_csv() will let pandas know that age should be interpreted as a number. This saves you lots of memory.

Problem with corrupt data

However, if your csv file would be corrupted, like this:

name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
Dennis, 40+, None-Ur-Bz

Then specifying dtype={'age':int} will break the .read_csv() command, because it cannot cast "40+" to int. So sanitize your data carefully!

Here you can see how the memory usage of a pandas dataframe is a lot higher when floats are kept as strings:

Try it yourself

df = pd.DataFrame(pd.np.random.choice(['1.0', '0.6666667', '150000.1'],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 224544 (~224 MB)

df = pd.DataFrame(pd.np.random.choice([1.0, 0.6666667, 150000.1],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 79560 (~79 MB)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...