Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
473 views
in Technique[技术] by (71.8m points)

dataframe - Unable to allocate 4.03 GiB for an array with shape (19321328, 28) and data type object

I have CSV files with 28 columns and >20,000 rows. I have 240 such files which amount to ~3GB memory space. I need to read all these CSV files and append them into a single Dataframe.

When I was working on fewer files, I was successfully able to run the code. Now when I am working on all the files, I am getting this error -

time:  130.8604452610016 sec

MemoryError: Unable to allocate 4.03 GiB for an array with shape (19321328, 28) and data type object

Also, the system hangs and I have to restart my CPU :((

This is what I have done till now -

path = r'C:UsersSakshi Sharma.spyder-py3filter' 
#path = r'H:T & F Safe DatafilterDatautm log' #This is when I am tring to read data from an externa hard drive
allFiles = glob.glob(os.path.join(path,"*.csv"))

np_array_list = []
start = time.time()
for file_ in allFiles:
    df = pd.read_csv( file_, index_col = None, header = None, low_memory = False)
    np_array_list.append(df.to_numpy())
end = time.time()
print("time: ",(end-start),"sec")

comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)
big_frame.to_csv(r"C:UsersSAKSHI SHARMA.spyder-py3Test.csv", index = False, header = None)
print(big_frame)

The output I get for running fewer number of csv files is as follows -

time:  1.5234270095825195 sec
         0           1       2   ...   25             26     27
0       NTP  1593577703  accept  ...  123  192.168.251.1    784
1       NTP  1593577704  accept  ...  123  192.168.251.1  56370
2       NTP  1593577704  accept  ...  123  192.168.251.1   7081
3       NTP  1593577704  accept  ...  123  192.168.251.1  46782
4       NTP  1593577704  accept  ...  123  192.168.251.1  38699
    ...         ...     ...  ...  ...            ...    ...
251154  NTP  1593602413  accept  ...  123  192.168.251.1  64161
251155  NTP  1593602413  accept  ...  123  192.168.251.1  30659
251156  NTP  1593602413  accept  ...  123  192.168.251.1  49763
251157  NTP  1593602413  accept  ...  123  192.168.251.1  56146
251158  NTP  1593602414  accept  ...  123  192.168.251.1    796

[251159 rows x 28 columns]

Can someone please tell me what I should do to read a big CSV. Also, does giving the path to an external hard drive where I have stored the CSV files create any problem?

path = r'C:UsersSakshi Sharma.spyder-py3utm log'
    or
path = r'H:SAKSHI SHARMAutm log'  

I tried this for lesser files and the amount of time taken was almost same in both the cases. But does it matter when the file size is huge?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...