r - Spreading a two column data frame with tidyr

Question

Welcome To Ask or Share your Answers For Others

r - Spreading a two column data frame with tidyr

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Spreading a two column data frame with tidyr

I have a data frame that looks like this:

and I want to turn it into this:

  x y z
1 8 3 5
2 6 4 6

But calling

library(tidyr)
df <- data.frame(
    a = c("x", "x", "y", "y", "z", "z"),
    b = c(8, 6, 3, 4, 5, 6)
)
df %>% spread(a, b)

returns

   x  y  z
1  8 NA NA
2  6 NA NA
3 NA  3 NA
4 NA  4 NA
5 NA NA  5
6 NA NA  6

What am I doing wrong?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:11:51+0000

While I'm aware you're after tidyr, base has a solution in this case:

unstack(df, b~a)

It's also a little bit faster:

Unit: microseconds

                expr     min      lq     mean  median       uq      max neval
 df %>% spread(a, b) 657.699 679.508 717.7725 690.484 724.9795 1648.381   100
  unstack(df, b ~ a) 309.891 335.264 349.4812 341.9635 351.6565 639.738   100

By popular demand, with something bigger

I haven't included the data.table solution as I'm not sure if pass by reference would be a problem for microbenchmark.

library(microbenchmark)
library(tidyr)
library(magrittr)

nlevels <- 3
#Ensure that all levels have the same number of elements
nrow <- 1e6 - 1e6 %% nlevels
df <- data.frame(a=sample(rep(c("x", "y", "z"), length.out=nrow)),
                 b=sample.int(9, nrow, replace=TRUE))

microbenchmark(df %>% spread(a, b),  unstack(df, b ~ a), data.frame(split(df$b,df$a)), do.call(cbind,split(df$b,df$a)))

Even on 1 million, unstack is faster. Notably, the split solution is also very fast.

Unit: milliseconds
                              expr       min        lq      mean    median       uq       max neval
               df %>% spread(a, b) 366.24426 414.46913 450.78504 453.75258 486.1113 542.03722   100
                unstack(df, b ~ a)  47.07663  51.17663  61.24411  53.05315  56.1114 102.71562   100
     data.frame(split(df$b, df$a))  19.44173  19.74379  22.28060  20.18726  22.1372  67.53844   100
 do.call(cbind, split(df$b, df$a))  26.99798  27.41594  31.27944  27.93225  31.2565  79.93624   100

Categories

r - Spreading a two column data frame with tidyr

r - Spreading a two column data frame with tidyr

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

By popular demand, with something bigger

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags