Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
871 views
in Technique[技术] by (71.8m points)

python - numpy corrcoef - compute correlation matrix while ignoring missing data

I am trying to compute a correlation matrix of several values. These values include some 'nan' values. I'm using numpy.corrcoef. For element(i,j) of the output correlation matrix I'd like to have the correlation calculated using all values that exist for both variable i and variable j.

This is what I have now:

In[20]: df_counties = pd.read_sql("SELECT Median_Age, Rpercent_2008, overall_LS, population_density FROM countyVotingSM2", db_eng)
In[21]: np.corrcoef(df_counties, rowvar = False)
Out[21]: 
array([[ 1.        ,         nan,         nan, -0.10998411],
       [        nan,         nan,         nan,         nan],
       [        nan,         nan,         nan,         nan],
       [-0.10998411,         nan,         nan,  1.        ]])

Too many nan's :(

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

One of the main features of pandas is being NaN friendly. To calculate correlation matrix, simply call df_counties.corr(). Below is an example to demonstrate df.corr() is NaN tolerant whereas np.corrcoef is not.

import pandas as pd
import numpy as np

# data
# ==============================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(100,5), columns=list('ABCDE'))
df[df < 0] = np.nan
df

         A       B       C       D       E
0   1.7641  0.4002  0.9787  2.2409  1.8676
1      NaN  0.9501     NaN     NaN  0.4106
2   0.1440  1.4543  0.7610  0.1217  0.4439
3   0.3337  1.4941     NaN  0.3131     NaN
4      NaN  0.6536  0.8644     NaN  2.2698
5      NaN  0.0458     NaN  1.5328  1.4694
6   0.1549  0.3782     NaN     NaN     NaN
7   0.1563  1.2303  1.2024     NaN     NaN
8      NaN     NaN     NaN  1.9508     NaN
9      NaN     NaN  0.7775     NaN     NaN
..     ...     ...     ...     ...     ...
90     NaN  0.8202  0.4631  0.2791  0.3389
91  2.0210     NaN     NaN  0.1993     NaN
92     NaN     NaN     NaN  0.1813     NaN
93  2.4125     NaN     NaN     NaN  0.2515
94     NaN     NaN     NaN     NaN  1.7389
95  0.9944  1.3191     NaN  1.1286  0.4960
96  0.7714  1.0294     NaN     NaN  0.8626
97     NaN  1.5133  0.5531     NaN  0.2205
98     NaN     NaN  1.1003  1.2980  2.6962
99     NaN     NaN     NaN     NaN     NaN

[100 rows x 5 columns]

# calculations
# ================================
df.corr()

        A       B       C       D       E
A  1.0000  0.2718  0.2678  0.2822  0.1016
B  0.2718  1.0000 -0.0692  0.1736 -0.1432
C  0.2678 -0.0692  1.0000 -0.3392  0.0012
D  0.2822  0.1736 -0.3392  1.0000  0.1562
E  0.1016 -0.1432  0.0012  0.1562  1.0000


np.corrcoef(df, rowvar=False)

array([[ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan],
       [ nan,  nan,  nan,  nan,  nan]])

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...