Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

text mining - How to split merged/glued words with no delimiter using R

I'm scraping text keywords from this article page using rvest in R using the code below:

#install.packages("xml2") # required for rvest
library("rvest") # for web scraping
library("dplyr") # for data management

#' start with get the link for the web to be scraped
page <- read_html("https://www.sciencedirect.com/science/article/pii/S1877042810004568")
keyW <- page %>% html_nodes("div.Keywords.u-font-serif") %>% html_text() %>% paste(collapse = ",")

And it gave me:

> keyW    
[1] "KeywordsPhysics curriculumTurkish education systemfinnish education systemPISAphysics achievement"

After removing the word "Keywords" and anything before it from the string using this line of code:

keyW <- gsub(".*Keywords","", keyW)

The new keyW is:

[1] "Physics curriculumTurkish education systemfinnish education systemPISAphysics achievement"

However, my desired output is this list:

[1] "Physics curriculum" "Turkish education system" "finnish education system" "PISA" "physics achievement"

How should I tackle this? I think this boils down to:

  1. how to properly scrape the keywords from the website
  2. how to properly split the string

Thanks

question from:https://stackoverflow.com/questions/65948543/how-to-split-merged-glued-words-with-no-delimiter-using-r

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You get the expected output directly if you use span tag to extract the words.

library(rvest)
page %>%  html_nodes("div.Keywords span") %>% html_text()

#[1] "Physics curriculum"       "Turkish education system" "finnish education system"
#[4] "PISA"                     "physics achievement"    

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...