Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
900 views
in Technique[技术] by (71.8m points)

numpy - python - how to compute correlation-matrix with nans in data-matrix

I coundn't find a function that computes a matrix of correlation coefficients for arrays containing observations for more than two variables when there are NaNs in the data. There are functions doing this for pairs of variables (or just masking the arrays using ~is.nan()). But using these functions by looping over a large number of variables, computing the correlation for each pair can be very time consuming.

So I tried on my own and soon realized that the complexity of doing this is a question of the proper normalization of the Covariance. I would be very interest in your opinions on how to do it.

Here is the code:

def nancorr(X,nanfact=False):
    X = X - np.nanmean(X,axis=1,keepdims = True)*np.ones((1,X.shape[1]))

    if nanfact:
        mask = np.isnan(X).astype(int)
        fact = X.shape[1] - np.dot(mask,mask.T) - 1    

    X[np.isnan(X)] = 0
    if nanfact:
        cov = np.dot(X,X.T)/fact
    else:
        cov = np.dot(X,X.T)

    d = np.diag(cov)
    return cov/np.sqrt(np.multiply.outer(d,d))

The function assumes that each row is a variable. It is basically an adjusted code from numpy's corrcoeff(). I believe there are three ways of doing this:

(1) For each pair of variables, you take only those observations for which neither one nor the other variable is NaN. This is arguably the most accurate, but also most difficult one to program if you want to do the computation for more than one pair simultaneously and not covered in the above code. Why, however, throw away information on the mean and variance of each variable, just because the corresponding entry of another variable is NaN? Hence, two other options.

(2) We demean each variable by it nanmean and the variance of each variable is its nanvariance. For the covariance, each observation where one or the other variable is NaN, but not both, is an observation of no-covariation and, therefore, set to zero. The factor of the covariance is then 1/(# of observation where not both variables are NaN - 1), denoted by n. Both variances in the denominator of the correlation coefficient are factored by their corresponding number of non-NaN observations minus 1, denoted by n1 and n2 respectively. This is achived by setting nanfact=True in the function above.

(3) One may wish that the covariance and the variances have the same factor as it is the case for correlation coefficient without NaNs. The only meaningful way to do this here (if option (1) is not feasable), is to simply ignore (1/n)/sqrt(1/n1*n2). Since this number is smaller than one, the estimated correlation coefficients will be larger (in absolute value) than in (2), but will remain between -1,1. This is achieved by setting nanfact=False.

I'd be very interested in your opinions on approaches (2) and (3) and especially, I would very much like to see a solution to (1) without the use of loops.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I think the method you are looking for is corr() from pandas. For example, a dataframe as following. You can also refer to this question. How to efficiently get the correlation matrix (with p-values) of a data frame with NaN values?

import pandas as pd
df = pd.DataFrame({'A': [2, None, 1, -4, None, None, 3],
                   'B': [None, 1, None, None, 1, 3, None],
                   'C': [2, 1, None, 2, 2.1, 1, 0],
                   'D': [-2, 1.1, 3.2, 2, None, 1, None]})

df
    A       B       C       D
0   2       NaN     2       -2
1   NaN     1       1       1.1
2   1       NaN     NaN     3.2
3   -4      NaN     2       2
4   NaN     1       2.1     NaN
5   NaN     3       1       1
6   3       NaN     0       NaN
rho = df.corr()
rho
       A          B            C           D
A   1.000000     NaN       -0.609994    -0.441784
B   NaN          1.0       -0.500000    -1.000000
C   -0.609994    -0.5       1.000000    -0.347928
D   0.041204     -1.0       -0.347928    1.000000

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...