I've create a tuple generator that extract information from a file filtering only the records of interest and converting it to a tuple that generator returns.
I've try to create a DataFrame from:
import pandas as pd
df = pd.DataFrame.from_records(tuple_generator, columns = tuple_fields_name_list)
but throws an error:
...
C:Anacondaenvspy33libsite-packagespandascoreframe.py in from_records(cls, data, index, exclude, columns, coerce_float, nrows)
1046 values.append(row)
1047 i += 1
-> 1048 if i >= nrows:
1049 break
1050
TypeError: unorderable types: int() >= NoneType()
I managed it to work consuming the generator in a list, but uses twice memory:
df = pd.DataFrame.from_records(list(tuple_generator), columns = tuple_fields_name_list)
The files I want to load are big, and memory consumption matters. The last try my computer spends two hours trying to increment virtual memory :(
The question: Anyone knows a method to create a DataFrame from a record generator directly, without previously convert it to a list?
Note: I'm using python 3.3 and pandas 0.12 with Anaconda on Windows.
Update:
It's not problem of reading the file, my tuple generator do it well, it scan a text compressed file of intermixed records line by line and convert only the wanted data to the correct types, then it yields fields in a generator of tuples form.
Some numbers, it scans 2111412 records on a 130MB gzip file, about 6.5GB uncompressed, in about a minute and with little memory used.
Pandas 0.12 does not allow generators, dev version allows it but put all the generator in a list and then convert to a frame. It's not efficient but it's something that have to deal internally pandas. Meanwhile I've must think about buy some more memory.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…