r - Is there a more efficient way to replace NULL with NA in a list?

Question

Welcome To Ask or Share your Answers For Others

r - Is there a more efficient way to replace NULL with NA in a list?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Is there a more efficient way to replace NULL with NA in a list?

I quite often come across data that is structured something like this:

employees <- list(
    list(id = 1,
             dept = "IT",
             age = 29,
             sportsteam = "softball"),
    list(id = 2,
             dept = "IT",
             age = 30,
             sportsteam = NULL),
    list(id = 3,
             dept = "IT",
             age = 29,
             sportsteam = "hockey"),
    list(id = 4,
             dept = NULL,
             age = 29,
             sportsteam = "softball"))

In many cases such lists could be tens of millions of items long, so memory concerns and efficiency are always a concern.

I would like to turn the list into a dataframe but if I run:

library(data.table)
employee.df <- rbindlist(employees)

I get errors because of the NULL values. My normal strategy is to use a function like:

nullToNA <- function(x) {
    x[sapply(x, is.null)] <- NA
    return(x)
}

and then:

employees <- lapply(employees, nullToNA)
employee.df <- rbindlist(employees)

which returns

   id dept age sportsteam
1:  1   IT  29   softball
2:  2   IT  30         NA
3:  3   IT  29     hockey
4:  4   NA  29   softball

However, the nullToNA function is very slow when applied to 10 million cases so it'd be good if there was a more efficient approach.

One point that seems to slow the process down it the is.null function can only be applied to one item at a time (unlike is.na which can scan a full list in one go).

Any advice on how to do this operation efficiently on a large dataset?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:57:51+0000

Many efficiency problems in R are solved by first changing the original data into a form that makes the processes that follow as fast and easy as possible. Usually, this is matrix form.

If you bring all the data together with rbind, your nullToNA function no longer has to search though nested lists, and therefore sapply serves its purpose (looking though a matrix) more efficiently. In theory, this should make the process faster.

Good question, by the way.

> dat <- do.call(rbind, lapply(employees, rbind))
> dat
     id dept age sportsteam
[1,] 1  "IT" 29  "softball"
[2,] 2  "IT" 30  NULL      
[3,] 3  "IT" 29  "hockey"  
[4,] 4  NULL 29  "softball"

> nullToNA(dat)
     id dept age sportsteam
[1,] 1  "IT" 29  "softball"
[2,] 2  "IT" 30  NA        
[3,] 3  "IT" 29  "hockey"  
[4,] 4  NA   29  "softball"

Categories

r - Is there a more efficient way to replace NULL with NA in a list?

r - Is there a more efficient way to replace NULL with NA in a list?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags