r - Correlation between NA columns

Question

Welcome To Ask or Share your Answers For Others

r - Correlation between NA columns

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Correlation between NA columns

I have to write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate (two columns) from each file where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no files meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows

My code looks like this

corr <- function(directory,threshold=0){
    a<-list.files("specdata")
    for (i in a) {
        data <- read.csv(paste(directory, "/", i, sep =""))
        x<-complete.cases(data)
        j<-sum(as.numeric(x))
        sulfate<-data[,2]
        nitrate<-data[,3]
        b<-cor(sulfate,nitrate)
    }  
    if (j>threshold) 
        return(b) 
    else
        numeric()
}

there's no error messege

If I type

z<-corr("specdata")

head(z) [1] NA

I don't know what the problem is. I don't know if NA values in the columns have to do with it. I think something is missing in my code. I think the read.csv creates a unique data frame when I need one data frame per file but I don't see why the return is NA in this case (when there's no threshold).

However, if I introduce a bigger threshold (1000):

z<-corr("specdata",1000)
head(z)
numeric(0)

The expected output I need is

cr <- corr("specdata", 150) 
head(cr) 
[1] -0.01895754 -0.14051254 -0.04389737 -0.06815956 -0.12350667 -0.07588814

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:42:30+0000

this is the correct and running solution you can refer to this 

corr <- function(directory, threshold = 0) {
  ## 'directory' is a character vector of length 1 indicating the location of
  ## the CSV files

  ## 'threshold' is a numeric vector of length 1 indicating the number of
  ## completely observed observations (on all variables) required to compute
  ## the correlation between nitrate and sulfate; the default is 0

  ## Return a numeric vector of correlations
  df = complete(directory)
  ids = df[df["nobs"] > threshold, ]$id
  corrr = numeric()
  for (i in ids) {

    newRead = read.csv(paste(directory, "/", formatC(i, width = 3, flag = "0"), 
                             ".csv", sep = ""))
    dff = newRead[complete.cases(newRead), ]
    corrr = c(corrr, cor(dff$sulfate, dff$nitrate))
  }
  return(corrr)
}
complete <- function(directory, id = 1:332) {
  f <- function(i) {
    data = read.csv(paste(directory, "/", formatC(i, width = 3, flag = "0"), 
                          ".csv", sep = ""))
    sum(complete.cases(data))
  }
  nobs = sapply(id, f)
  return(data.frame(id, nobs))
}
cr <- corr("specdata", 150)
head(cr)

Categories

r - Correlation between NA columns

r - Correlation between NA columns

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags