Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
543 views
in Technique[技术] by (71.8m points)

r - Extract URLs with regex into a new data frame column

I want to use a regex to extract all URLs from text in a dataframe, into a new column. I have some older code that I have used to extract keywords, so I'm looking to adapt the code for a regex. I want to save a regex as a string variable and apply here:

data$ContentURL <- apply(sapply(regex, grepl, data$Content, fixed=FALSE), 1, function(x) paste(selection[x], collapse=','))

It seems that fixed=FALSE should tell grepl that its a regular expression, but R doesn't like how I am trying to save the regex as:

regex <- "http.*?1-\d+,\d+"

My data is organized in a data frame like this:

data <- read.table(text='"Content"     "date"   
 1     "a house a home https://www.foo.com"     "12/31/2013"
 2     "cabin ideas https://www.example.com in the woods"     "5/4/2013"
 3     "motel is a hotel"   "1/4/2013"', header=TRUE)

And would hopefully look like:

                                           Content       date              ContentURL
1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
3                                 motel is a hotel   1/4/2013                        
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Hadleyverse solution (stringr package) with a decent URL pattern:

library(stringr)

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

data$ContentURL <- str_extract(data$Content, url_pattern)

data

##                                            Content       date              ContentURL
## 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
## 3                                 motel is a hotel   1/4/2013                    <NA>

You can use str_extract_all if there are multiples in Content, but that will involve some extra processing on your end afterwards.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...