I'm trying to find out, if there is faster approach than gsub vectorized function in R. I have following data frame with some "sentences" (sent$words) and then I have words for removing from these sentences (stored in wordsForRemoving variable).
sent <- data.frame(words =
c("just right size and i love this notebook", "benefits great laptop",
"wouldnt bad notebook", "very good quality", "bad orgtop but great",
"great improvement for that bad product but overall is not good",
"notebook is not good but i love batterytop"),
user = c(1,2,3,4,5,6,7),
stringsAsFactors=F)
wordsForRemoving <- c("great","improvement","love","great improvement","very good","good",
"right", "very","benefits", "extra","benefit","top","extraordinarily",
"extraordinary", "super","benefits super","good","benefits great",
"wouldnt bad")
Then I'm gonna create "big data" simulation for time consumption computing...
df.expanded <- as.data.frame(replicate(1000000,sent$words))
library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),1000000),]
rownames(sent) <- NULL
Using of following gsub approach for removing words (wordsForRemoving) from sent$words takes 72.87 sec. I know, this is not good simulation but in real I'm using word dictionary with more than 3.000 words for 300.000 sentences and overall processing takes over 1.5 hours.
pattern <- paste0("\b(?:", paste(wordsForRemoving, collapse = "|"), ")\b ?")
res <- gsub(pattern, "", sent$words)
# user system elapsed
# 72.87 0.05 73.79
Please, could anyone help me to write faster approach for my task. Any help or advice is very appreciated. Thanks a lot in forward.
See Question&Answers more detail:
os