Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
461 views
in Technique[技术] by (71.8m points)

r - Remove URLs from string

I have a vector of strings—myStrings—in R that look something like:

[1] download file from `http://example.com`
[2] this is the link to my website `another url`
[3] go to `another url` from more info.

where another url is a valid http url but stackoverflow will not let me insert more than one url thats why i'm writing another url instead. I want to remove all the urls from myStrings to look like:

[1] download file from
[2] this is the link to my website
[3] go to from more info.

I've tried many functions in the stringr package but nothing works.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can use gsub with a regular expression to match URLs,

Set up a vector:

x <- c(
    "download file from http://example.com", 
    "this is the link to my website http://example.com", 
    "go to http://example.com from more info.",
    "Another url ftp://www.example.com",
    "And https://www.example.net"
)

Remove all the URLs from each string:

gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
# [1] "download file from"             "this is the link to my website"
# [3] "go to from more info."          "Another url"                   
# [5] "And"   

Update: It would be best if you could post a few different URLs so we know what we're working with. But I think this regular expression will work for the URLs you mentioned in the comments:

" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)"

The above expression explained:

  • ? optional space
  • (f|ht) match "f" or "ht"
  • tp match "tp"
  • (s?) optionally match "s" if it's there
  • (://) match "://"
  • (.*) match every character (everything) up to
  • [.|/] a period or a forward-slash
  • (.*) then everything after that

I'm not an expert with regular expressions, but I think I explained that correctly.

Note: url shorteners are no longer allowed in SO answers, so I was forced to remove a section while making my most recent edit. See edit history for that part.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...