I suggest you have a look at the source code in order to get into the nitty gritty of what rolling does. In particular I suggest you have a look at the rolling
functions in generic.py and window.py. From there you can have a look at the Window
class which is used if you specify a window type or the default Rolling
class. The last one inherits from _Rolling_and_Expanding
and ultimately _Rolling
and _Window
.
That said, I'll give my two cents: Pandas' whole rolling mechanism relies on the numpy function apply_along_axis
. In particular it is used here in pandas. It is used in conjunction with the windows.pyx
cython module. In goes your series, out comes the aggregated rolling window. For typical aggregation functions it handles them for you efficiently, but for custom ones (using apply()
) it uses a roll_generic()
in windows.pyx
.
The rolling function in pandas operates on pandas data frame columns independently. It is not a python iterator, and is lazy loaded, meaning nothing is computed until you apply an aggregation function to it. The functions which actually apply the rolling window of data aren't used until right before an aggregation is done.
A source of confusion might be that you're thinking of the rolling object as a dataframe. (You have named the rolling object df
in your last code snippet). It really isn't. It is an object which can produce dataframes by applying aggregations over the window logic it houses.
The lambda you are supplying is applied for each cell of your new dataframe. It takes a window backwards (along each column) in your old dataframe, and it aggregates it to one single cell in the new dataframe. The aggregation can be things like sum
, mean
, something custom you've made, etc., over some window size, say 3. Here are some examples:
a = np.arange(5)
df = pd.DataFrame(a, columns=['a'])
df.rolling(3).mean().dropna()
... which can also be done by:
df.rolling(3).apply(np.mean).dropna()
... and produces:
a
2 3.0
3 6.0
4 9.0
(The first column is the index value and can be ignored here, and for the next examples.)
Notice how we supplied an existing numpy aggregation function. That's the idea. We're supposed to be able to supply anything we want as long as it conforms to what aggregation functions do, i.e., take a vector of values and produce a single value from it. Here is another one where we create a custom aggregation function, in this case the L2 norm of the window:
df.rolling(3).apply(lambda x: np.sqrt(x.dot(x))).dropna()
if you're not familiar with lambda functions this is the same as:
def euclidean_dist(x):
return np.sqrt(x.dot(x))
df.rolling(3).apply(euclidean_dist).dropna()
... yielding:
a
2 2.236068
3 3.741657
4 5.385165
Just to make sure, we can manually check that np.sqrt(0**2 + 1**2 + 2**2)
is indeed 2.236068
.
[In your original edit, in the] last code snippet, your code is probably failing early than you expect. It is failing before the invocation of df.apply(...)
You are trying to add a rolling object named df
to the number 2 before it is passed to df.apply(...)
. The rolling object isn't something you do operations on. The aggregation function you have supplied also doesn't conform to an aggregation function in general. The a
is a list with the values of a window, b
would be a constant extra parameter you pass in. It can be a rolling object if you want, but it wouldn't typically be something you would like to do. To make it more clear, here is something which is similar to what you were doing in your original edit but works:
a = np.arange(8)
df = pd.DataFrame(a, columns=['a'])
n = 4
rol = df.rolling(n)
def prod(window_list, constant_rol):
return window_list.dot(constant_rol.sum().dropna().head(n))
rol.apply(prod, args=(rol,)).dropna()
# [92.0, 140.0, 188.0, 236.0, 284.0]
It is a contrived example, but I'm showing it to make the point that you can pass in whatever you want as a constant, even the rolling object you are using itself. The dynamic part is the first argument a
in your case or window_list
in my case. All defined windows, in the form of individual lists, are passed into that function one by one.
Based on your followup comments this might be what you're looking for:
import numpy as np
import pandas as pd
n = 3
a = np.arange(5)
df = pd.DataFrame(a, columns=['a'])
def keep(window, windows):
windows.append(window.copy())
return window[-1]
windows = list()
df['a'].rolling(n).apply(keep, args=(windows,))
df = df.tail(n)
df['a_window'] = windows
which adds arrays/vectors to each rolling block thus producing:
a a_window
2 2 [0.0, 1.0, 2.0]
3 3 [1.0, 2.0, 3.0]
4 4 [2.0, 3.0, 4.0]
Note that it only works if you do it on a column at a time. If you want to do some math on the window before you store it away in keep
that is fine too.
That said, without more input on exactly what you are trying to achieve it is hard to construct an example which suits your needs.
If your ultimate goal is to create a dataframe of lagging variables then I'd go for using real columns using shift()
:
import numpy as np
import pandas as pd
a = np.arange(5)
df = pd.DataFrame(a, columns=['a'])
for i in range(1,3):
df['a-%s' % i] = df['a'].shift(i)
df.dropna()
... giving:
a a-1 a-2
2 2 1.0 0.0
3 3 2.0 1.0
4 4 3.0 2.0
(There might be some more beautiful way of doing it, but it gets the job done.)
Regarding your variable b
in your first code snippet, remember DataFrames in pandas aren't typically handled as tensors of arbitrary dimensions/object. You can probably stuff whatever you want into it, but ultimately strings, time objects, ints and floats is what is expected. That might be the reasons the designers of pandas haven't bothered with allowing rolling aggregation to non-scalar values. It doesn't even seem like a simple string is allowed as output of the aggregation function.
Anyway, I hope this answer some of your questions. If not let me know, and I'll try to help you out in the comments, or an update.
Final note on the _create_blocks()
function of rolling objects.
The _create_blocks()
function handles the reindexing and binning when you use the freq
argument of rolling
.
If you use freq with, say, weeks such that freq=W
:
import pandas as pd
a = np.arange(50)
df = pd.DataFrame(a, columns=['a'])
df.index = pd.to_datetime('2016-01-01') + pd.to_timedelta(df['a'], 'D')
blocks, obj, index = df.rolling(4, freq='W')._create_blocks(how=None)
for b in blocks:
print(b)
... then we get the binned (not rolling) original data week-by-week:
a
a
2016-01-03 2.0
2016-01-10 9.0
2016-01-17 16.0
2016-01-24 23.0
2016-01-31 30.0
2016-02-07 37.0
2016-02-14 44.0
2016-02-21 NaN
Notice that this isn't the output of the aggregated rolling. This is simply the new blocks it works on. After this. We do an aggregation like sum
and get:
a
a
2016-01-03 NaN
2016-01-10 NaN
2016-01-17 NaN
2016-01-24 50.0
2016-01-31 78.0
2016-02-07 106.0
2016-02-14 134.0
2016-02-21 NaN
... which checks out with a test summation: 50 = 2 + 9 + 16 + 23.
If you don't use freq
as an argument it simply returns the original data structure:
import pandas as pd
a = np.arange(5)
df = pd.DataFrame(a, columns=['a'])
blocks, obj, index = df.rolling(3)._create_blocks(how=None)
for b in blocks:
print(b)
... which produces ...
a
a
2016-01-01 0
2016-01-02 1
2016-01-03 2
2016-01-04 3
2016-01-05 4
... and is used for rolling window aggregation.