python 3.x - Dask apply with custom function

Question

Welcome To Ask or Share your Answers For Others

python 3.x - Dask apply with custom function

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python 3.x - Dask apply with custom function

I am experimenting with Dask, but I encountered a problem while using apply after grouping.

I have a Dask DataFrame with a large number of rows. Let's consider for example the following

N=10000
df = pd.DataFrame({'col_1':np.random.random(N), 'col_2': np.random.random(N) })
ddf = dd.from_pandas(df, npartitions=8)

I want to bin the values of col_1 and I follow the solution from here

bins = np.linspace(0,1,11)
labels = list(range(len(bins)-1))
ddf2 = ddf.map_partitions(test_f, 'col_1',bins,labels)

where

def test_f(df,col,bins,labels):
    return df.assign(bin_num = pd.cut(df[col],bins,labels=labels))

and this works as I expect it to.

Now I want to take the median value in each bin (taken from here)

median = ddf2.groupby('bin_num')['col_1'].apply(pd.Series.median).compute()

Having 10 bins, I expect median to have 10 rows, but it actually has 80. The dataframe has 8 partitions so I guess that somehow the apply is working on each one individually.

However, If I want the mean and use mean

median = ddf2.groupby('bin_num')['col_1'].mean().compute()

it works and the output has 10 rows.

The question is then: what am I doing wrong that is preventing apply from operating as mean?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T20:07:18+0000

Maybe this warning is the key (Dask doc: SeriesGroupBy.apply) :

Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.

Categories

python 3.x - Dask apply with custom function

python 3.x - Dask apply with custom function

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags