Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
595 views
in Technique[技术] by (71.8m points)

python - Why Sklearn TruncatedSVD's explained variance ratios are not in descending order?

Why Sklearn.decomposition.TruncatedSVD's explained variance ratios are not ordered by singular values?

My code is below:

X = np.array([[1,1,1,1,0,0,0,0,0,0,0,0,0,0],
           [0,0,1,1,1,1,1,1,1,0,0,0,0,0],
           [0,0,0,0,0,0,1,1,1,1,1,1,0,0],
           [0,0,0,0,0,0,0,0,0,0,1,1,1,1]])
svd = TruncatedSVD(n_components=4)
svd.fit(X4)
print(svd.explained_variance_ratio_)
print(svd.singular_values_)

and the results:

[0.17693405 0.46600983 0.21738089 0.13967523]
[3.1918354  2.39740372 1.83127499 1.30808033]

I heard that a singular value means how much the component can explain data, so I think explained variance ratios also are followed by the order of singular values. But the ratios are not ordered by descending order.

Can someone explain why does it happen?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I heard that a singular value means how much the component can explain data

This holds for PCA, but it is not exactly true for (truncated) SVD; quoting from a relevant Github thread back in the day when an explained_variance_ratio_ attribute was not even available for TruncatedSVD (2014 - emphasis mine):

preserving the variance is not the exact objective function of truncated SVD without centering

So, the singular values themselves are indeed sorted in descending order, but this does not hold necessarily for the corresponding explained variance ratios if the data are not centered.

But if we do center the data before, then the explained variance ratios come out sorted in descending order indeed, in correspondence with the singular values themselves:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD

sc = StandardScaler()
Xs = sc.fit_transform(X) # X data from the question here

svd = TruncatedSVD(n_components=4)
svd.fit(Xs)

print(svd.explained_variance_ratio_)
print(svd.singular_values_)

Result:

[4.60479851e-01 3.77856541e-01 1.61663608e-01 8.13905807e-66]
[5.07807756e+00 4.59999633e+00 3.00884730e+00 8.21430014e-17]

For the mathematical & computational differences between centered and non-centered data in PCA & SVD calculations, see How does centering make a difference in PCA (for SVD and eigen decomposition)?


Regarding the use of TruncatedSVD itself, here is user ogrisel again (scikit-learn contributor) in a relevant answer in Difference between scikit-learn implementations of PCA and TruncatedSVD:

In practice TruncatedSVD is useful on large sparse datasets which cannot be centered without making the memory usage explode.

So, it's not exactly clear why you have selected to use TruncatedSVD here, but, if you don't have a too-large dataset that causes memory issues, I guess you should revert to PCA instead.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...