Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
226 views
in Technique[技术] by (71.8m points)

r - Comparing string matches between groups by category

I have a large dataset in the following form with text strings extracted from a corpus:

Category      Group         Text_Strings 
1             A             c(string1, string2, string3)
1             A             c(string1, string3)
1             B             character(0)
1             B             c(string1)
1             B             c(string3)

2             A             character(0)
2             A             character(0)
2             B             c(string1, string3)

3             A             c(string1, string2, string3)
3             A             character(0)
3             A             c(string1)
3             B             character(0)
3             B             c(string1, string2, string3)

...where A and B have string1 and string3 in common in Category 1; none in common in Category 2; and all three in common in Category 3.

I'd like to obtain a count of the number of strings that match between Groups A and B for each Class. String matches can be in any order; e.g. c(string1, string2) evaluated against c(string2, string1) should be counted as two matches. Also, matches should only be between unique strings in each category; e.g. c(string1, string2), c(string1) vs. c(string2, string1) should still only be two matches. For example:

Category      Group         Text_Strings 
4             A             c(string1, string2, string3)
4             A             c(string1)
4             B             c(string1)
4             B             c(string1)

... would yield only one match even though string1 is repeated.

My final output should look like this:

Category     Matches
1            2
2            0 
3            3
4            1

I did quite a bit of research but I wasn't able to figure out an answer on my own. It occurs to me that I might be able to subset the dataframe by Group, somehow aggregate/concatenate strings over Categories, then use lapply() and intersect()... something like

for(i in 1:nrow(data)[1]) {
    data$matches[i] <- sum(intersect(subset(data, Group=="A")$Text_Strings[i], 
                                     subset(data, Group=="B")$Text_Strings[i])) 
}

Of course this is missing steps and doesn't work, but am I on the right track? Thanks for any help!

UPDATE: jeremycg's solution was extremely helpful, but my data was so messy that it wouldn't accept parse(). Thanks to another user in a different thread, I got around this problem by splitting rows based on comma separators rather than trying to unnest directly:

library(tidyverse)
x %>% 
     separate_rows(Text_Strings, sep = ",") %>% # split on commas
      dmap_at("Text_Strings", ~ gsub("^c\("|")$|"", "", .x))

This produced the same unnested data but was much cleaner.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

you can use dplyr and tidyr:

library(dplyr)
library(tidyr)
x %>% unnest() %>% #spread out the nested columns
      distinct() %>% #remove dupes
      group_by(Category) %>% #by Category
      summarise(out = sum(Text_Strings[Group  == 'A'] %in% Text_Strings[Group  == 'B'])) #sum the overlap

giving:

Source: local data frame [3 x 2]

  Category   out
     (int) (int)
1        1     2
2        2     0
3        3     3

You actual data is pretty messed up - you should try and fix whatever is outputting it to a 'long format'. Here's a clunky fix:

x$listcites =  gsub('\\n', '',x$listcites) #remove newlines
x$listcites = gsub(""", "'", x$listcites, fixed = TRUE) #remove quotes to singles
x$listcites[grepl('^[^c]',x$listcites)] = paste("c('", x$listcites[grepl('^[^c]',x$listcites)],"')", sep = '') #fix single lines to same format
x$listcites = sapply(x$listcites, function(x) eval(parse(text = x))) #eval to vecs in dataframe
x %>% unnest() %>%
      distinct %>%
      group_by(case_num) %>%
      summarise(out = sum(listcites[type  == 'claimant'] %in% listcites[type  == 'court'])) #sum the overlap

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...