I've a data frame with 100+ columns. cor() returns remarkably quickly, but tells me far too much, especially as most columns are not correlated. I'd like it to just tell me column pairs and their correlation, ideally ordered.
In case that doesn't make sense here is an artificial example:
df = data.frame(a=1:10,b=20:11*20:11,c=runif(10),d=runif(10),e=runif(10)*1:10)
z = cor(df)
z looks like this:
a b c d e
a 1.0000000 -0.9966867 -0.38925240 -0.35142452 0.2594220
b -0.9966867 1.0000000 0.40266637 0.35896626 -0.2859906
c -0.3892524 0.4026664 1.00000000 0.03958307 0.1781210
d -0.3514245 0.3589663 0.03958307 1.00000000 -0.3901608
e 0.2594220 -0.2859906 0.17812098 -0.39016080 1.0000000
What I'm looking for is a function that will instead tell me:
a:b -0.9966867
b:c 0.4026664
d:e -0.39016080
a:c -0.3892524
b:d 0.3589663
a:d -0.3514245
b:e -0.2859906
a:e 0.2594220
c:e 0.17812098
c:d 0.03958307
I have a crude way to get rid of some of the noise:
z[abs(z)<0.5]=0
then scan looking for non-zero values. But it is far inferior to the desired output above.
UPDATE:
Based on the answers received, and some trial and error, here is the solution I went with:
z[lower.tri(z,diag=TRUE)]=NA #Prepare to drop duplicates and meaningless information
z=as.data.frame(as.table(z)) #Turn into a 3-column table
z=na.omit(z) #Get rid of the junk we flagged above
z=z[order(-abs(z$Freq)),] #Sort by highest correlation (whether +ve or -ve)
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…