regex - Substring extraction from vector in R

Question

Welcome To Ask or Share your Answers For Others

regex - Substring extraction from vector in R

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - Substring extraction from vector in R

I am trying to extract substrings from a unstructured text. For example, assume a vector of country names:

countries <- c("United States", "Israel", "Canada")

How do I go about passing this vector of character values to extract exact matches from unstructured text.

text.df <- data.frame(ID = c(1:5), 
text = c("United States is a match", "Not a match", "Not a match",
         "Israel is a match", "Canada is a match"))

In this example, the desired output would be:

ID     text
1      United States
4      Israel
5      Canada

So far I have been working with gsub by where I remove all non-matches and then eliminate then remove rows with empty values. I have also been working with str_extract from the stringr package, but haven't had success getting the arugments for the regular expression correct. Any assistance would be greatly appreciated!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:23:28+0000

Here's an approach with data.table:

library(data.table)
##
R>  data.table(text.df)[
    sapply(countries, function(x) grep(x,text),USE.NAMES=F),
    list(ID, text = countries)]
   ID          text
1:  1 United States
2:  4        Israel
3:  5        Canada

Categories

regex - Substring extraction from vector in R

regex - Substring extraction from vector in R

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags