python - Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

Question

Welcome To Ask or Share your Answers For Others

python - Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

Update:

The pandas df was created like this:

df = pd.read_sql(query, engine)
encoded = pd.get_dummies(df, columns=['account'])

Creating a dask df from this df looks like this:

df = dd.from_pandas(encoded, 50)

Performing the operation with dask results in no visible progress being made (checking with dask diagnostics):

result = df.groupby('journal_entry').max().reset_index().compute()

Original:

I have a large pandas df with 2.7M rows and 4,000 columns. All but four of the columns are of dtype uint8. The uint8 columns only hold values of 1 or 0. I am attempting to perform this operation on the df:

result = df.groupby('id').max().reset_index()

Predictably, this operation immediately returns a memory error. My initial thought is to chunk the df both horizontally and vertically. However, this creates a messy situation, since the .max() needs to be performed across all the uint8 columns, not just a pair of columns. In addition, it is still extremely slow to chunk the df like this. I have 32 GB of RAM on my machine.

What strategy could mitigate the memory issue?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:56:00+0000

If you have any categorical columns in your data (rather than categories stored as object columns or strings), make sure you use the observed=True option in your groupby command. This makes sure it only creates lines where an entry is present, e.g. only one line per customer_id,order_id combination, rather than creating n_custs * n_orders lines!

I just did a groupby-sum on a 26M row dataset, never going above 7GB of RAM. Before adding the observed=True option, it was going up to 62GB and then running out.

Categories

python - Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

python - Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags