There are two versions of agg (short for aggregate) and apply: The first is defined on groupby objects and the second one is defined on DataFrames.
If you consider groupby.agg
and groupby.apply
, the main difference would be that the apply is flexible (docs):
Some operations on the grouped data might not fit into either the
aggregate or transform categories. Or, you may simply want GroupBy to
infer how to combine the results. For these, use the apply function,
which can be substituted for both aggregate and transform in many
standard use cases.
Note: apply can act as a reducer, transformer, or filter function,
depending on exactly what is passed to apply. So depending on the path
taken, and exactly what you are grouping. Thus the grouped columns(s)
may be included in the output as well as set the indices.
See Python Pandas : How to return grouped lists in a column as a dict for example for an illustration of how the returning type is automatically changed.
groupby.agg
, on the other hand, is very good for applying cython optimized functions (i.e. being able to calculate 'sum'
, 'mean'
, 'std'
etc. very fast). It also allows calculating multiple (different) functions on different columns. For example,
df.groupby('some_column').agg({'first_column': ['mean', 'std'],
'second_column': ['sum', 'sem']}
calculates the mean and the standard deviation on the first column and sum and standard error of the mean on the second column. See dplyr summarize equivalent in pandas for more examples.
These differences are also summarized in What is the difference between pandas agg and apply function? But that one focuses on the differences between groupby.agg
and groupby.apply
.
DataFrame.agg
is new in version 0.20. Earlier, we weren't able to apply multiple different functions to different columns because it was only possible with groupby objects. Now, you can summarize a DataFrame by calculating multiple different functions on its columns. Example from Is there a pandas equivalent of dplyr::summarise?:
iris.agg({'sepal_width': 'min', 'petal_width': 'max'})
petal_width 2.5
sepal_width 2.0
dtype: float64
iris.agg({'sepal_width': ['min', 'median'], 'sepal_length': ['min', 'mean']})
sepal_length sepal_width
mean 5.843333 NaN
median NaN 3.0
min 4.300000 2.0
This is not possible with DataFrame.apply
. It either goes column by column or row by row and executes the same function on that column/row. For a single function like lambda x: x**2
they produce the same results but their intended usage is very different.