I have CSV files with 28 columns and >20,000 rows.
I have 240 such files which amount to ~3GB memory space.
I need to read all these CSV files and append them into a single Dataframe.
When I was working on fewer files, I was successfully able to run the code. Now when I am working on all the files, I am getting this error -
time: 130.8604452610016 sec
MemoryError: Unable to allocate 4.03 GiB for an array with shape (19321328, 28) and data type object
Also, the system hangs and I have to restart my CPU :((
This is what I have done till now -
path = r'C:UsersSakshi Sharma.spyder-py3filter'
#path = r'H:T & F Safe DatafilterDatautm log' #This is when I am tring to read data from an externa hard drive
allFiles = glob.glob(os.path.join(path,"*.csv"))
np_array_list = []
start = time.time()
for file_ in allFiles:
df = pd.read_csv( file_, index_col = None, header = None, low_memory = False)
np_array_list.append(df.to_numpy())
end = time.time()
print("time: ",(end-start),"sec")
comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)
big_frame.to_csv(r"C:UsersSAKSHI SHARMA.spyder-py3Test.csv", index = False, header = None)
print(big_frame)
The output I get for running fewer number of csv files is as follows -
time: 1.5234270095825195 sec
0 1 2 ... 25 26 27
0 NTP 1593577703 accept ... 123 192.168.251.1 784
1 NTP 1593577704 accept ... 123 192.168.251.1 56370
2 NTP 1593577704 accept ... 123 192.168.251.1 7081
3 NTP 1593577704 accept ... 123 192.168.251.1 46782
4 NTP 1593577704 accept ... 123 192.168.251.1 38699
... ... ... ... ... ... ...
251154 NTP 1593602413 accept ... 123 192.168.251.1 64161
251155 NTP 1593602413 accept ... 123 192.168.251.1 30659
251156 NTP 1593602413 accept ... 123 192.168.251.1 49763
251157 NTP 1593602413 accept ... 123 192.168.251.1 56146
251158 NTP 1593602414 accept ... 123 192.168.251.1 796
[251159 rows x 28 columns]
Can someone please tell me what I should do to read a big CSV. Also, does giving the path to an external hard drive where I have stored the CSV files create any problem?
path = r'C:UsersSakshi Sharma.spyder-py3utm log'
or
path = r'H:SAKSHI SHARMAutm log'
I tried this for lesser files and the amount of time taken was almost same in both the cases. But does it matter when the file size is huge?
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…