regex - Faster approach than gsub in r

Question

Welcome To Ask or Share your Answers For Others

regex - Faster approach than gsub in r

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - Faster approach than gsub in r

I'm trying to find out, if there is faster approach than gsub vectorized function in R. I have following data frame with some "sentences" (sent$words) and then I have words for removing from these sentences (stored in wordsForRemoving variable).

sent <- data.frame(words = 
                     c("just right size and i love this notebook", "benefits great laptop",
                       "wouldnt bad notebook", "very good quality", "bad orgtop but great",
                       "great improvement for that bad product but overall is not good", 
                       "notebook is not good but i love batterytop"), 
                   user = c(1,2,3,4,5,6,7),
                   stringsAsFactors=F)

wordsForRemoving <- c("great","improvement","love","great improvement","very good","good",
                      "right", "very","benefits", "extra","benefit","top","extraordinarily",
                      "extraordinary", "super","benefits super","good","benefits great",
                      "wouldnt bad")

Then I'm gonna create "big data" simulation for time consumption computing...

df.expanded <- as.data.frame(replicate(1000000,sent$words))
library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),1000000),]
rownames(sent) <- NULL

Using of following gsub approach for removing words (wordsForRemoving) from sent$words takes 72.87 sec. I know, this is not good simulation but in real I'm using word dictionary with more than 3.000 words for 300.000 sentences and overall processing takes over 1.5 hours.

pattern <- paste0("\b(?:", paste(wordsForRemoving, collapse = "|"), ")\b ?")
res <- gsub(pattern, "", sent$words)

#  user  system elapsed 
# 72.87    0.05   73.79

Please, could anyone help me to write faster approach for my task. Any help or advice is very appreciated. Thanks a lot in forward.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:19:19+0000

As mentioned by Jason, stringi is good option for you..

Following is the performance of stringi

system.time(res <- gsub(pattern, "", sent$words))
   user  system elapsed 
 66.229   0.000  66.199 

library(stringi)
system.time(stri_replace_all_regex(sent$words, pattern, ""))
   user  system elapsed 
 21.246   0.320  21.552

Update (Thanks Arun)

system.time(res <- gsub(pattern, "", sent$words, perl = TRUE))
   user  system elapsed 
 12.290   0.000  12.281

Categories

regex - Faster approach than gsub in r

regex - Faster approach than gsub in r

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags