The most special type of the operations you describe are available as cummax
, cummin
, cumprod
and cumsum
(f(x) = x + f(x-1)
).
More functionality can be found in expanding
objects: mean, standard deviation, variance kurtosis, skewness, correlation, etc.
And for the most general case, you can use expanding().apply()
with a custom function. For example,
from functools import reduce # For Python 3.x
ser.expanding().apply(lambda r: reduce(lambda prev, value: prev + 2*value, r))
is equivalent to f(x) = 2x + f(x-1)
The methods I listed are optimized and run quite fast but when you use a custom function the performance gets worse. For exponential smoothing, pandas starts to outperform loops for Series of length 1000 but expanding().apply()
's performance with reduce is quite bad:
np.random.seed(0)
ser = pd.Series(70 + 5*np.random.randn(10**4))
ser.tail()
Out:
9995 60.953592
9996 70.211794
9997 72.584361
9998 69.835397
9999 76.490557
dtype: float64
ser.ewm(alpha=0.1, adjust=False).mean().tail()
Out:
9995 69.871614
9996 69.905632
9997 70.173505
9998 70.139694
9999 70.774781
dtype: float64
%timeit ser.ewm(alpha=0.1, adjust=False).mean()
1000 loops, best of 3: 779 μs per loop
With loops:
def exp_smoothing(ser, alpha=0.1):
prev = ser[0]
res = [prev]
for cur in ser[1:]:
prev = alpha*cur + (1-alpha)*prev
res.append(prev)
return pd.Series(res, index=ser.index)
exp_smoothing(ser).tail()
Out:
9995 69.871614
9996 69.905632
9997 70.173505
9998 70.139694
9999 70.774781
dtype: float64
%timeit exp_smoothing(ser)
100 loops, best of 3: 3.54 ms per loop
Total time is still in milliseconds but with expanding().apply()
:
ser.expanding().apply(lambda r: reduce(lambda p, v: 0.9*p+0.1*v, r)).tail()
Out:
9995 69.871614
9996 69.905632
9997 70.173505
9998 70.139694
9999 70.774781
dtype: float64
%timeit ser.expanding().apply(lambda r: reduce(lambda p, v: 0.9*p+0.1*v, r))
1 loop, best of 3: 13 s per loop
Methods like cummin
, cumsum
are optimized and only require x's current value and function's previous value. However with a custom function the complexity is O(n**2)
. This is mainly because there will be cases that function's previous value and x's current value won't be enough to calculate function's current value. For cumsum, you can use previous cumsum and add the current value to reach a result. You cannot do that for, say, geometric mean. That's why expanding
will become unusable for even moderately sized Series.
In general, iterating over a Series is not a very expensive operation. For DataFrames, it needs to return a copy of each row so it is very inefficient but this is not the case for Series. Of course you should use vectorized methods when available but if that's not the case, using a for loop for a task like recursive calculation is OK.