Another alternative is to precompute the values you will need before using groupby/agg
:
import numpy as np
import pandas as pd
N = 1000
df = pd.DataFrame({'A' : np.random.choice(['foo', 'bar'], replace=True, size=(N,)),
'B' : np.random.randint(-10, 10, size=(N,)),
'C' : np.random.randint(-10, 10, size=(N,))})
def using_precomputation(df):
df['B2'] = df['B'] * (df['C'] >= 0).astype(int)
df['C2'] = df['C'] * (df['B'] >= 0).astype(int)
result = df.groupby('A').agg({'B2': 'sum', 'C2': 'sum'})
return result.rename(columns={'B2':'B', 'C2':'C'})
Let's compare using_precomputation
with using_index
and using_apply
:
def using_index(df):
result = df.groupby('A').agg({'B': lambda x: df.loc[x.index, 'C'][x >= 0].sum(),
'C': lambda x: df.loc[x.index, 'B'][x >= 0].sum()})
return result.rename(columns={'B':'C', 'C':'B'})
def my_func(row):
b = row[row.C >= 0].B.sum()
c = row[row.B >= 0].C.sum()
return pd.Series({'B':b, 'C':c})
def using_apply(df):
return df.groupby('A').apply(my_func)
First, let's check that they all return the same result:
def is_equal(df, func1, func2):
result1 = func1(df).sort_index(axis=1)
result2 = func2(df).sort_index(axis=1)
assert result1.equals(result2)
is_equal(df, using_precomputation, using_index)
is_equal(df, using_precomputation, using_apply)
Using the 1000-row DataFrame above:
In [83]: %timeit using_precomputation(df)
100 loops, best of 3: 2.45 ms per loop
In [84]: %timeit using_index(df)
100 loops, best of 3: 4.2 ms per loop
In [85]: %timeit using_apply(df)
100 loops, best of 3: 6.84 ms per loop
Why is using_precomputation
faster?
Precomputation allows us to take advantage of fast vectorized arithmetic on
entire columns and allows the aggregation function to be the simple builtin
sum
. Builtin aggregators tend to be faster than custom aggregation functions
such as the ones used here (based on jezrael's solution):
def using_index(df):
result = df.groupby('A').agg({'B': lambda x: df.loc[x.index, 'C'][x >= 0].sum(),
'C': lambda x: df.loc[x.index, 'B'][x >= 0].sum()})
return result.rename(columns={'B':'C', 'C':'B'})
Moreover, the less work you have to do on each little group, the better off you
are performance-wise. Having to do double-indexing for each group hurts performance.
Also a killer to performance is using groupby/apply(func)
where the func
returns a Series
. This forms one Series for each row of the result, and then
causes Pandas to align and concatenate all the Series. Since typically the
Series tend to be short and the number of Series tends to be big, concatenating
all these little Series tends to be slow. Again, you tend to get the best
performance out of Pandas/NumPy when performing vectorized operations on
big arrays. Looping through lots of tiny results kills performance.