Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
237 views
in Technique[技术] by (71.8m points)

r - Matching Data Tables by five columns to change a value in another column

I have Ubuntu 16.04 and run R in terminal. I am working with big data tables, one has 75 millions rows and 11 columns (dt1) and another has 7 million rows and 7 columns (dt2). All values are numeric. Both tables have 'id' column'. I need to find all rows in the first one which have the same values for five columns as five columns in the second one, and change in the first data table 'id' value for these rows to the one in the second data table. In both data tables the compared columns have the same name, let us say that they are V1, V2, V3, V4 and V5. I've converted second data table to data frame format, so I can use its 'id' as index . I've tried it for 1000 first rows and it took 40 minutes.

for (i in 1:1000) {
    dt1[(V1==dt2[i,V1] & V2==dt2[i,V2] &
         V3==dt2[i,V3] & V4==dt2[i,V4] &
         V5==dt2[i,V5]), id:=i]
}

I'm going to parallelize it, but due to memory constrictions I can use only 2 or 3 cores. Clearly it won't be sufficient. Are there quick and efficient ways to do it on my home comp? If to do it on AWS, what kind of tricks are useful there? In particular, how many cores may I use there simultaneously?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In R it is always preferable to avoid loops wherever possible, as they are usually much slower than alternative vectorized solutions.

This operation can be done with a data.table join. Basically, when you run

dt1[dt2];

you are performing a right-join between the two data.tables. The preset key columns of dt1 determine which columns to join on. If dt1 has no preset key, the operation fails. But you can specify the on argument to manually select the key columns on-the-fly:

key <- paste0('V',1:5);
dt1[dt2,on=key];

(The alternative of course is to preset a key, using either setkey() or setkeyv().)

The above operation will actually just return a merged table containing data from both dt1 and dt2, which is not what you want. But we can make use of the j argument of the data.table indexing function and the := in-place assignment syntax to assign the id column of dt2 to the id column of dt1. Because we have a name conflict, we must use i.id to reference the id column of dt2, while the unmodified name id still refers to the id column of dt1. This is simply the mechanism provided by data.table for disambiguating conflicting names. Hence, you're looking for:

dt1[dt2,on=key,id:=i.id];

Here's an example that uses only two key columns and just a few rows of data (for simplicity). I also generated the keys to include some non-matching rows, just to demonstrate that the non-matching rows will have their ids left untouched by the operation.

set.seed(1L);
dt1 <- data.table(id=1:12,expand.grid(V1=1:3,V2=1:4),blah1=rnorm(12L));
dt2 <- data.table(id=13:18,expand.grid(V1=1:2,V2=1:3),blah2=rnorm(6L));
dt1;
##     id V1 V2      blah1
##  1:  1  1  1 -0.6264538
##  2:  2  2  1  0.1836433
##  3:  3  3  1 -0.8356286
##  4:  4  1  2  1.5952808
##  5:  5  2  2  0.3295078
##  6:  6  3  2 -0.8204684
##  7:  7  1  3  0.4874291
##  8:  8  2  3  0.7383247
##  9:  9  3  3  0.5757814
## 10: 10  1  4 -0.3053884
## 11: 11  2  4  1.5117812
## 12: 12  3  4  0.3898432
dt2;
##    id V1 V2       blah2
## 1: 13  1  1 -0.62124058
## 2: 14  2  1 -2.21469989
## 3: 15  1  2  1.12493092
## 4: 16  2  2 -0.04493361
## 5: 17  1  3 -0.01619026
## 6: 18  2  3  0.94383621
key <- paste0('V',1:2);
dt1[dt2,on=key,id:=i.id];
dt1;
##     id V1 V2      blah1
##  1: 13  1  1 -0.6264538
##  2: 14  2  1  0.1836433
##  3:  3  3  1 -0.8356286
##  4: 15  1  2  1.5952808
##  5: 16  2  2  0.3295078
##  6:  6  3  2 -0.8204684
##  7: 17  1  3  0.4874291
##  8: 18  2  3  0.7383247
##  9:  9  3  3  0.5757814
## 10: 10  1  4 -0.3053884
## 11: 11  2  4  1.5117812
## 12: 12  3  4  0.3898432

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...