Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
399 views
in Technique[技术] by (71.8m points)

r - combine duplicates, do not publish blanks, dplyr::distinct

I trying to using dplyr distinct to combine rows, delete duplicates, and delete blanks as well. Here is my data frame:

unique_id   school  subject  grade  sex
    1       great   Math      88    
    1       great   English   78    
    1       great   History   98    male
    2       spring  Math      65    
    2       spring  English   72    female
    2       spring  History   84    

When I run (thank you Akrun):

(r2 <- df %>%
  group_by(unique_id) %>% 
  summarise_each(funs(toString(unique(.)))))

I get:

unique_id   school  subject                     grade       sex
    1       great   Math, English, History      88,78,98     , male 
    2       spring  English, English, History   65,72,84     , female

I don't want blanks to be included or commas in the last variable, sex. Instead, I'd like it to look as follows:

unique_id   school  subject                     grade       sex
    1       great   Math, English, History      88,78,98     male   
    2       spring  English, English, History   65,72,84     female

Any tried adding NA on the import, then removing it after condensing and that didn't work. Any ideas how to condense rows, but only keep the value in the row and ignore blanks? Thank you.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Perhaps the reason that you are having problems is that you are using empty strings when you should be using NAs. This is what I would assume is the idiomatic code.

df <- data.frame(unique_id = c(rep(1,3),rep(2,3)),
                school = c(rep('great',3),rep('spring',3)),
                           subject = rep(c("Math", "English", "History"),2),
                           grade = c(88,78,98,65,72,84),
                           sex = c(NA,NA, "male", NA, "female", NA))

r2 <- df %>%
  group_by(unique_id) %>% 
  summarise_each(funs(toString(unique(.))))

which returns

# A tibble: 2 x 5
  unique_id school                subject      grade        sex
      <dbl>  <chr>                  <chr>      <chr>      <chr>
1         1  great Math, English, History 88, 78, 98   NA, male
2         2 spring Math, English, History 65, 72, 84 NA, female

You can always

 r2$sex <- sapply(stringr::str_split(r2$sex, ", "),"[",2)

afterwards if you really want to remove those NAs, but I see them as informative.

You can write your own function to supply to summarize_each, which will allow you to take care of NAs in any column. Note, that you only need to do this because unique, rightfully so, does not have an na.rm argument.

rm_na_unique <- function(vec){
  unique(vec[!is.na(vec)])
}

r2 <- df %>%
       group_by(unique_id) %>% 
       summarise_each(funs(toString(rm_na_unique(.))))

Gives you the same result

# A tibble: 2 x 5
  unique_id school                subject      grade    sex
      <dbl>  <chr>                  <chr>      <chr>  <chr>
1         1  great Math, English, History 88, 78, 98   male
2         2 spring Math, English, History 65, 72, 84 female

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...