Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
264 views
in Technique[技术] by (71.8m points)

dataframe - Extract p values and r values for all pairwise variables

I have multiple variables for multiple countries over multiple years. I would like to generate a dataframe containing both an R^2 value and a P value for each pair of variables. I'm somewhat close, have a minimum working example and an idea of what the end product should look like, but am having some difficulties actually implementing it. If anyone could help, that would be most appreciated.

Please note, I would like to do this more manually than using packages like Hmisc as that has created a number of other issues. I'd had a look around for similar solutions as well, but havent had much luck.

# Code to generate minimum working example (country year pairs).  

library(tidyindexR)
library(tidyverse)
library(dplyr)
library(reshape2)
 
# Function to generate minimum working example data 

simulateCountryData = function(N=200, NEACH = 20, SEED=100){
                            
        variableOne<-rnorm(N,sample(1:100, NEACH),0.5)
        variableOne[variableOne<0]<-0

        variableTwo<-rnorm(N,sample(1:100, NEACH),0.5)
        variableTwo[variableTwo<0]<-0
        
        variableThree<-rnorm(N,sample(1:100, NEACH),0.5)
        variableThree[variableTwo<0]<-0
        
        geocodeNum<-factor(rep(seq(1,N/NEACH),each=NEACH))
        
        year<-rep(seq(2000,2000+NEACH-1,1),N/NEACH)
        
        # Putting it all together
        AllData<-data.frame(geocodeNum,
                            year,
                            variableOne,
                            variableTwo,
                            variableThree)
        
        return(AllData)
}

 
# This runs the function and generates the data 
mySimData = simulateCountryData()

I have a reasonable idea of how to get correlations (both p values and r values) between 2 manually selected variables, but am having some trouble implementing it on the entire dataset and on a country level (rather than all at once).

# Example pvalue 
corrP = cor.test(spreadMySimData$variableOne,spreadMySimData$variableTwo)$p.value
# Examplwe r value
corrEst = cor(spreadMySimData$variableOne,spreadMySimData$variableTwo) 

Finally, the end result should look something like this :

myVariables = colnames(spreadMySimData[3:ncol(spreadMySimData)])
myMatrix = expand.grid(myVariables,myVariables)

# I'm having trouble actually trying to get the r values and p values in the dataframe
myMatrix = as.data.frame(myMatrix)
myMatrix$Pval = runif(9,0.01,1) 
myMatrix$Rval = runif(9,0.2,1) 
myMatrix

Thanks again :)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This will compute r and p for all the unique pairs.

# matrix of unique pairs coded as numeric
mx_combos <- combn(1:length(myVariables), 2)
# list of unique pairs coded as numeric
ls_combos <- split(mx_combos, rep(1:ncol(mx_combos), each = nrow(mx_combos)))
# for each pair in the list, create a 1 x 4 dataframe
ls_rows <- lapply(ls_combos, function(p) {
  # lookup names of variables
  v1 <- myVariables[p[1]]
  v2 <- myVariables[p[2]]
  # perform the cor.test()
  htest <- cor.test(mySimData[[v1]], mySimData[[v2]])
  # record pertinent info in a dataframe
  data.frame(Var1 = v1, 
             Var2 = v2, 
             Pval = htest$p.value, 
             Rval = unname(htest$estimate))
  })
# row bind the list of dataframes
dplyr::bind_rows(ls_rows)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...