Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
366 views
in Technique[技术] by (71.8m points)

r - Grouping factor levels in a data.table

I'm trying to combine factor levels in a data.table & wondering if there's a data.table-y way to do so.

Example:

DT = data.table(id = 1:20, ind = as.factor(sample(8, 20, replace = TRUE)))

I want to say types 1,3,8 are in group A; 2 and 4 are in group B; and 5,6,7 are in group C.

Here's what I've been doing, which has been quite slow in the full version of the problem:

DT[ind %in% c(1, 3, 8), grp := as.factor("A")]
DT[ind %in% c(2, 4), grp := as.factor("B")]
DT[ind %in% c(5, 6, 7), grp := as.factor("C")]

Another approach, suggested by this related question, would I guess translate like so:

DT[ , grp := ind]
levels(DT$grp) = c("A", "B", "A", "B", "C", "C", "C", "A")

Or perhaps (given I've got 65 underlying groups and 18 aggregated groups, this feels a little neater)

DT[ , grp := ind]
lev <- letters(1:8)
lev[c(1, 3, 8)] <- "A"
lev[c(2, 4)] <- "B"
lev[5:7] <- "C"
levels(DT$grp) <- lev

Both of these seem unwieldy; does this seem like the appropriate way to do this in data.table?

For reference, I timed a beefed up version of this with 10,000,000 observations and some more subgroup/supergroup levels. My original approach is slowest (having to run all those logic checks is costly), the second the fastest, and the third a close second. But I like the readability of that approach better.

(Keying DT before searching speeds things up, but it only halves the gap vis-a-vis the latter two methods)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Update:

I recently learned of a much simpler way to re-associate factor levels from this question and a closer reading of ?levels. No merges, correspondence table, etc. necessary, just pass a named list to levels:

levels(DT$ind) = list(A = c(1, 3, 8), B = c(2, 4), C = 5:7)

Original Answer:

As suggested by @Arun we have the option of creating the correspondence as a separate data.table, then joining it to the original:

match_dt = data.table(ind = as.factor(1:12),
                      grp = as.factor(c("A", "B", "A", "B", "C", "C",
                                        "C", "A", "D", "E", "F", "D")))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]

We can also do this in (what I consider to be) the more readable fashion like so (with marginal speed costs):

levels <- letters[1:12]
levels[c(1, 3, 8)] <- "A"
levels[c(2, 4)] <- "B"
levels[5:7] <- "C"
levels[c(9, 12)] <- "D"
levels[10] <- "E"
levels[11] <- "F"
match_dt <- data.table(ind = as.factor(1:12),
                       grp = as.factor(levels))
setkey(DT, ind)
setkey(match_dt, ind)
DT = match_dt[DT]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...