I have about 30 GB of data (in a list of about 900 dataframes) that I am attempting to concatenate together. The machine I am working with is a moderately powerful Linux Box with about 256 GB of ram. However, when I try to concatenate my files I quickly run out of available ram. I have tried all sorts of workarounds to fix this (concatenating in smaller batches with for loops, etc.) but I still cannot get these to concatenate. Two questions spring to mind:
Has anyone else dealt with this and found an effective workaround? I cannot use a straight append because I need the 'column merging' (for lack of a better word) functionality of the join='outer'
argument in pd.concat()
.
Why is Pandas concatenation (which I know is just calling numpy.concatenate
) so inefficient with its use of memory?
I should also note that I do not think the problem is an explosion of columns as concatenating 100 of the dataframes together gives about 3000 columns whereas the base dataframe has about 1000.
Edit:
The data I am working with is financial data about 1000 columns wide and about 50,000 rows deep for each of my 900 dataframes. The types of data going across left to right are:
- date in string format,
string
np.float
int
... and so on repeating. I am concatenating on column name with an outer join which means that any columns in df2
that are not in df1
will not be discarded but shunted off to the side.
Example:
#example code
data=pd.concat(datalist4, join="outer", axis=0, ignore_index=True)
#two example dataframes (about 90% of the column names should be in common
#between the two dataframes, the unnamed columns, etc are not a significant
#number of the columns)
print datalist4[0].head()
800_1 800_2 800_3 800_4 900_1 900_2 0 2014-08-06 09:00:00 BEST_BID 1117.1 103 2014-08-06 09:00:00 BEST_BID
1 2014-08-06 09:00:00 BEST_ASK 1120.0 103 2014-08-06 09:00:00 BEST_ASK
2 2014-08-06 09:00:00 BEST_BID 1106.9 11 2014-08-06 09:00:00 BEST_BID
3 2014-08-06 09:00:00 BEST_ASK 1125.8 62 2014-08-06 09:00:00 BEST_ASK
4 2014-08-06 09:00:00 BEST_BID 1117.1 103 2014-08-06 09:00:00 BEST_BID
900_3 900_4 1000_1 1000_2 ... 2400_4 0 1017.2 103 2014-08-06 09:00:00 BEST_BID ... NaN
1 1020.1 103 2014-08-06 09:00:00 BEST_ASK ... NaN
2 1004.3 11 2014-08-06 09:00:00 BEST_BID ... NaN
3 1022.9 11 2014-08-06 09:00:00 BEST_ASK ... NaN
4 1006.7 10 2014-08-06 09:00:00 BEST_BID ... NaN
_1 _2 _3 _4 _1.1 _2.1 _3.1 _4.1 0 #N/A Invalid Security NaN NaN NaN #N/A Invalid Security NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN
dater
0 2014.8.6
1 2014.8.6
2 2014.8.6
3 2014.8.6
4 2014.8.6
[5 rows x 777 columns]
print datalist4[1].head()
150_1 150_2 150_3 150_4 200_1 200_2 0 2013-12-04 09:00:00 BEST_BID 1639.6 30 2013-12-04 09:00:00 BEST_ASK
1 2013-12-04 09:00:00 BEST_ASK 1641.8 133 2013-12-04 09:00:08 BEST_BID
2 2013-12-04 09:00:01 BEST_BID 1639.5 30 2013-12-04 09:00:08 BEST_ASK
3 2013-12-04 09:00:05 BEST_BID 1639.4 30 2013-12-04 09:00:08 BEST_ASK
4 2013-12-04 09:00:08 BEST_BID 1639.3 133 2013-12-04 09:00:08 BEST_BID
200_3 200_4 250_1 250_2 ... 2500_1 0 1591.9 133 2013-12-04 09:00:00 BEST_BID ... 2013-12-04 10:29:41
1 1589.4 30 2013-12-04 09:00:00 BEST_ASK ... 2013-12-04 11:59:22
2 1591.6 103 2013-12-04 09:00:01 BEST_BID ... 2013-12-04 11:59:23
3 1591.6 133 2013-12-04 09:00:04 BEST_BID ... 2013-12-04 11:59:26
4 1589.4 133 2013-12-04 09:00:07 BEST_BID ... 2013-12-04 11:59:29
2500_2 2500_3 2500_4 Unnamed: 844_1 Unnamed: 844_2 0 BEST_ASK 0.35 50 #N/A Invalid Security NaN
1 BEST_ASK 0.35 11 NaN NaN
2 BEST_ASK 0.40 11 NaN NaN
3 BEST_ASK 0.45 11 NaN NaN
4 BEST_ASK 0.50 21 NaN NaN
Unnamed: 844_3 Unnamed: 844_4 Unnamed: 848_1 dater
0 NaN NaN #N/A Invalid Security 2013.12.4
1 NaN NaN NaN 2013.12.4
2 NaN NaN NaN 2013.12.4
3 NaN NaN NaN 2013.12.4
4 NaN NaN NaN 2013.12.4
[5 rows x 850 columns]
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…