Update:
The pandas df was created like this:
df = pd.read_sql(query, engine)
encoded = pd.get_dummies(df, columns=['account'])
Creating a dask df from this df looks like this:
df = dd.from_pandas(encoded, 50)
Performing the operation with dask results in no visible progress being made (checking with dask diagnostics):
result = df.groupby('journal_entry').max().reset_index().compute()
Original:
I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype uint8. The uint8 columns only hold values of 1 or 0. I am attempting to perform this operation on the df:
result = df.groupby('id').max().reset_index()
Predictably, this operation immediately returns a memory error. My initial thought is to chunk the df both horizontally and vertically. However, this creates a messy situation, since the .max()
needs to be performed across all the uint8 columns, not just a pair of columns. In addition, it is still extremely slow to chunk the df like this. I have 32 GB of RAM on my machine.
What strategy could mitigate the memory issue?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…