Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
358 views
in Technique[技术] by (71.8m points)

r - Why is rbindlist "better" than rbind?

I am going through documentation of data.table and also noticed from some of the conversations over here on SO that rbindlist is supposed to be better than rbind.

I would like to know why is rbindlist better than rbind and in which scenarios rbindlist really excels over rbind?

Is there any advantage in terms of memory utilization?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

rbindlist is an optimized version of do.call(rbind, list(...)), which is known for being slow when using rbind.data.frame


Where does it really excel

Some questions that show where rbindlist shines are

Fast vectorized merge of list of data.frames by row

Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

These have benchmarks that show how fast it can be.


rbind.data.frame is slow, for a reason

rbind.data.frame does lots of checking, and will match by name. (i.e. rbind.data.frame will account for the fact that columns may be in different orders, and match up by name), rbindlist doesn't do this kind of checking, and will join by position

eg

do.call(rbind, list(data.frame(a = 1:2, b = 2:3), data.frame(b = 1:2, a = 2:3)))
##    a b
## 1  1 2
## 2  2 3
## 3  2 1
## 4  3 2

rbindlist(list(data.frame(a = 1:5, b = 2:6), data.frame(b = 1:5, a = 2:6)))
##     a b
##  1: 1 2
##  2: 2 3
##  3: 1 2
##  4: 2 3

Some other limitations of rbindlist

It used to struggle to deal with factors, due to a bug that has since been fixed:

rbindlist two data.tables where one has factor and other has character type for a column (Bug #2650)

It has problems with duplicate column names

see Warning message: in rbindlist(allargs) : NAs introduced by coercion: possible bug in data.table? (Bug #2384)


rbind.data.frame rownames can be frustrating

rbindlist can handle lists data.frames and data.tables, and will return a data.table without rownames

you can get in a muddle of rownames using do.call(rbind, list(...)) see

How to avoid renaming of rows when using rbind inside do.call?


Memory efficiency

In terms of memory rbindlist is implemented in C, so is memory efficient, it uses setattr to set attributes by reference

rbind.data.frame is implemented in R, it does lots of assigning, and uses attr<- (and class<- and rownames<- all of which will (internally) create copies of the created data.frame.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...