I have a large dataset in the following form with text strings extracted from a corpus:
Category Group Text_Strings
1 A c(string1, string2, string3)
1 A c(string1, string3)
1 B character(0)
1 B c(string1)
1 B c(string3)
2 A character(0)
2 A character(0)
2 B c(string1, string3)
3 A c(string1, string2, string3)
3 A character(0)
3 A c(string1)
3 B character(0)
3 B c(string1, string2, string3)
...where A and B have string1 and string3 in common in Category 1; none in common in Category 2; and all three in common in Category 3.
I'd like to obtain a count of the number of strings that match between Groups A and B for each Class. String matches can be in any order; e.g. c(string1, string2) evaluated against c(string2, string1) should be counted as two matches. Also, matches should only be between unique strings in each category; e.g. c(string1, string2), c(string1) vs. c(string2, string1) should still only be two matches. For example:
Category Group Text_Strings
4 A c(string1, string2, string3)
4 A c(string1)
4 B c(string1)
4 B c(string1)
... would yield only one match even though string1 is repeated.
My final output should look like this:
Category Matches
1 2
2 0
3 3
4 1
I did quite a bit of research but I wasn't able to figure out an answer on my own. It occurs to me that I might be able to subset the dataframe by Group, somehow aggregate/concatenate strings over Categories, then use lapply() and intersect()... something like
for(i in 1:nrow(data)[1]) {
data$matches[i] <- sum(intersect(subset(data, Group=="A")$Text_Strings[i],
subset(data, Group=="B")$Text_Strings[i]))
}
Of course this is missing steps and doesn't work, but am I on the right track? Thanks for any help!
UPDATE: jeremycg's solution was extremely helpful, but my data was so messy that it wouldn't accept parse(). Thanks to another user in a different thread, I got around this problem by splitting rows based on comma separators rather than trying to unnest directly:
library(tidyverse)
x %>%
separate_rows(Text_Strings, sep = ",") %>% # split on commas
dmap_at("Text_Strings", ~ gsub("^c\("|")$|"", "", .x))
This produced the same unnested data but was much cleaner.
See Question&Answers more detail:
os