I have been following a similar answer here, but I have some questions when using sklearn and rolling apply. I am trying to create z-scores and do PCA with rolling apply, but I keep on getting 'only length-1 arrays can be converted to Python scalars' error.
Following the previous example I create a dataframe
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
sc=StandardScaler()
tmp=pd.DataFrame(np.random.randn(2000,2)/10000,index=pd.date_range('2001-01-01',periods=2000),columns=['A','B'])
If I use the rolling
command:
tmp.rolling(window=5,center=False).apply(lambda x: sc.fit_transform(x))
TypeError: only length-1 arrays can be converted to Python scalars
I get this error. I can however create functions with mean and standard deviations with no problem.
def test(df):
return np.mean(df)
tmp.rolling(window=5,center=False).apply(lambda x: test(x))
I believe the error occurs when I am trying to subtract the mean by the current values for z-score.
def test2(df):
return df-np.mean(df)
tmp.rolling(window=5,center=False).apply(lambda x: test2(x))
only length-1 arrays can be converted to Python scalars
How can I create custom rolling functions with sklearn to first standardize and then run PCA?
EDIT:
I realize my question was not exactly clear so I shall try again. I want to standardize my values and then run PCA to get the amount of variance explained by each factor. Doing this without rolling is fairly straightforward.
testing=sc.fit_transform(tmp)
pca=decomposition.pca.PCA() #run pca
pca.fit(testing)
pca.explained_variance_ratio_
array([ 0.50967441, 0.49032559])
I cannot use this same procedure when rolling. Using the rolling zscore function from @piRSquared gives the zscores. It seems that PCA from sklearn is incompatible with the rolling apply custom function. (In fact I think this is the case with most sklearn modules.) I am just trying to get the explained variance which is a one dimensional item, but this code below returns a bunch of NaNs.
def test3(df):
pca.fit(df)
return pca.explained_variance_ratio_
tmp.rolling(window=5,center=False).apply(lambda x: test3(x))
However, I can create my own explained variance function, but this also does not work.
def test4(df):
cov_mat=np.cov(df.T) #need covariance of features, not observations
eigen_vals,eigen_vecs=np.linalg.eig(cov_mat)
tot=sum(eigen_vals)
var_exp=[(i/tot) for i in sorted(eigen_vals,reverse=True)]
return var_exp
tmp.rolling(window=5,center=False).apply(lambda x: test4(x))
I get this error 0-dimensional array given. Array must be at least two-dimensional
.
To recap, I would like to run rolling z-scores and then rolling pca outputting the explained variance at each roll. I have the rolling z-scores down but not explained variance.
See Question&Answers more detail:
os