Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
152 views
in Technique[技术] by (71.8m points)

r - Proper/fastest way to reshape a data.table

I have a data table in R:

library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
      x y  v
 [1,] 1 A 12
 [2,] 1 B 62
 [3,] 1 A 60
 [4,] 1 B 61
 [5,] 2 A 83
 [6,] 2 B 97
 [7,] 2 A  1
 [8,] 2 B 22
 [9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49

I can easily sum the variable v by the groups in the data.table:

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out
     x  y SUM
[1,] 1 A  72
[2,] 1 B 123
[3,] 2 A  84
[4,] 2 B 119
[5,] 3 A 162
[6,] 3 B  96

However, I would like to have the groups (y) as columns, rather than rows. I can accomplish this using reshape:

out <- reshape(out,direction='wide',idvar='x', timevar='y')
out
     x SUM.A SUM.B
[1,] 1    72   123
[2,] 2    84   119
[3,] 3   162    96

Is there a more efficient way to reshape the data after aggregating it? Is there any way to combine these operations into one step, using the data.table operations?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The data.table package implements faster melt/dcast functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.

melt/dcast functions for data.table have been available since v1.9.0 and the features include:

  • There is no need to load reshape2 package prior to casting. But if you want it loaded for other operations, please load it before loading data.table.

  • dcast is also a S3 generic. No more dcast.data.table(). Just use dcast().

  • melt:

    • is capable of melting on columns of type 'list'.

    • gains variable.factor and value.factor which by default are TRUE and FALSE respectively for compatibility with reshape2. This allows for directly controlling the output type of variable and value columns (as factors or not).

    • melt.data.table's na.rm = TRUE parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.

    • NEW: melt can accept a list for measure.vars and columns specified in each element of the list will be combined together. This is faciliated further through the use of patterns(). See vignette or ?melt.

  • dcast:

    • accepts multiple fun.aggregate and multiple value.var. See vignette or ?dcast.

    • use rowid() function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.

  • Old benchmarks:

    • melt : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.
    • dcast : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.

Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast pull request to reshape2?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...