Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
423 views
in Technique[技术] by (71.8m points)

r - Creating an "other" field

Right now, I have the following data.frame which was created by original.df %.% group_by(Category) %.% tally() %.% arrange(desc(n)).

DF <- structure(list(Category = c("E", "K", "M", "L", "I", "A", 
"S", "G", "N", "Q"), n = c(163051, 127133, 106680, 64868, 49701, 
47387, 47096, 45601, 40056, 36882)), .Names = c("Category", 
"n"), row.names = c(NA, 10L), class = c("tbl_df", "tbl", "data.frame"
))

         Category      n
1               E 163051
2               K 127133
3               M 106680
4               L  64868
5               I  49701
6               A  47387
7               S  47096
8               G  45601
9               N  40056
10              Q  36882

I want to create an "Other" field from the bottom ranked Categories by n. i.e.

        Category      n
1              E 163051
2              K 127133
3              M 106680
4              L  64868
5              I  49701
6          Other 217022

Right now, I am doing

rbind(filter(DF, rank(rev(n)) <= 5), 
  summarise(filter(DF, rank(rev(n)) > 5), Category = "Other", n = sum(n)))

which collapses all categories not in the top 5 into the Other category.

But I'm curious whether there's a better way in dplyr or some other existing package. By "better" I mean more succinct/readable. I'm also interested in methods with cleverer or more flexible ways to choose Other.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is another approach, assuming that each category (of the top 5 at least) only occurs once:

df %.% 
  arrange(desc(n)) %.%       #you could skip this step since you arranged the input df already according to your question
  mutate(Category = ifelse(1:n() > 5, "Other", Category)) %.%
  group_by(Category) %.%
  summarize(n = sum(n))

#  Category      n
#1        E 163051
#2        I  49701
#3        K 127133
#4        L  64868
#5        M 106680
#6    Other 217022

Edit:

I just noticed that my output is not order by decreasing n any more. After running the code again, I found out that the order is kept until after the group_by(Category) but when I run the summarize afterwards, the order is gone (or rather, it seems to be ordered by Category). Is that supposed to be like that?

Here are three more ways:

m <- 5    #number of top results to show in final table (excl. "Other")
n <- m+1

#preserves the order (or better: reesatblishes it by index)
df <- arrange(df, desc(n)) %.%    #this could be skipped if data already ordered 
  mutate(idx = 1:n(), Category = ifelse(idx > m, "Other", Category)) %.%
  group_by(Category) %.%
  summarize(n = sum(n), idx = first(idx)) %.%
  arrange(idx) %.%
  select(-idx)

#doesnt preserve the order (same result as in first dplyr solution, ordered by Category)
df[order(df$n, decreasing=T),]     #this could be skipped if data already ordered 
df[n:nrow(df),1] <- "Other"
df <- aggregate(n ~ Category, data = df, FUN = "sum")

#preserves the order (without extra index)
df[order(df$n, decreasing=T),]     #this could be skipped if data already ordered 
df[n:nrow(df),1] <- "Other"
df[n,2] <- sum(df$n[df$Category == "Other"]) 
df <- df[1:n,]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...