Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
514 views
in Technique[技术] by (71.8m points)

r - Using RecordLinkage to add a column with a number for each person

I'd like to do what I think is a very simple operation -- adding a column with a number for each person to a dataset with a list of (potentially) duplicative names. I think that I am close. This code looks at a dataset of names, does pairwise comparisons, and appends a column whether there is a likely match. Now I just want to go one step further -- instead of dropping duplicates, I want to come up with a unique identifier.

Peter


Example:

Peter

Peter

Peter

Connor

Matt

would become

Example:

Peter -- 1

Peter -- 1

Peter -- 1

Connor -- 2

Matt -- 3

library(RecordLinkage)
data(RLdata10000)
rpairs <- compare.dedup(RLdata10000, blockfld = 5)
p=epiWeights(rpairs)
classify <- epiClassify(p,0.7)
summary(classify)
match <- classify$prediction
results <- cbind(classify$pairs,match)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

small rewrite avoiding that the weights and classifier have to be tuned with the IDs,

df_names <- data.frame(Name=c("Peter","Peter","Peter","Connor","Matt"))

df_names %>% compare.dedup() %>%
             epiWeights() %>%
             epiClassify(0.3) %>%
             getPairs(show = "links", single.rows = TRUE) -> matches

left_join(mutate(df_names,ID = 1:nrow(df_names)), 
          select(matches,id1,id2) %>% arrange(id1) %>% filter(!duplicated(id2)), 
          by=c("ID"="id2")) %>%
    mutate(ID = ifelse(is.na(id1), ID, id1) ) %>%
    select(-id1)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.8k users

...