Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
561 views
in Technique[技术] by (71.8m points)

scikit learn - how does sklearn do Linear regression when p >n?

it's known that when the number of variables (p) is larger than the number of samples (n) the least square estimator is not defined.

In sklearn I receive this values:

In [30]: lm = LinearRegression().fit(xx,y_train)

In [31]: lm.coef_
Out[31]: 
array([[ 0.20092363, -0.14378298, -0.33504391, ..., -0.40695124,
         0.08619906, -0.08108713]])

In [32]: xx.shape
Out[32]: (1097, 3419)

Call [30] should return an error. How does sklearn work when p>n like in this case?

EDIT: It seems that the matrix is filled with some values

if n > m:
        # need to extend b matrix as it will be filled with
        # a larger solution matrix
        if len(b1.shape) == 2:
            b2 = np.zeros((n, nrhs), dtype=gelss.dtype)
            b2[:m,:] = b1
        else:
            b2 = np.zeros(n, dtype=gelss.dtype)
            b2[:m] = b1
        b1 = b2
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

When the linear system is underdetermined, then the sklearn.linear_model.LinearRegression finds the minimum L2 norm solution, i.e.

argmin_w l2_norm(w) subject to Xw = y

This is always well defined and obtainable by applying the pseudoinverse of X to y, i.e.

w = np.linalg.pinv(X).dot(y)

The specific implementation of scipy.linalg.lstsq, which is used by LinearRegression uses get_lapack_funcs(('gelss',), ... which is precisely a solver that finds the minimum norm solution via singular value decomposition (provided by LAPACK).

Check out this example

import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(5, 10)
y = rng.randn(5)

from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=False)
coef1 = lr.fit(X, y).coef_
coef2 = np.linalg.pinv(X).dot(y)

print(coef1)
print(coef2)

And you will see that coef1 == coef2. (Note that fit_intercept=False is specified in the constructor of the sklearn estimator, because otherwise it would subtract the mean of each feature before fitting the model, yielding different coefficients)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...