Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
4.1k views
in Technique[技术] by (71.8m points)

string matching on large file in R

I have a dataset of text entries with id-s. I have a list of match vectors, each consisting of words.

I want to count the number of times words from each match vector occur in each entry.

There are about ten match vectors with 10-100 words in each.

The dataset has about 10^8 entries in it. The entries range in length between 1:600 words with a median of about 50.

It's a lot of data, basically.

I have a solution for a small (say, 10^6 entries) dataset. But it scales horribly.

Here's an approximate reprex.


library(tidyverse)
library(magrittr)

# match vectors

fruit = c('apple', 'banana', 'cherry')
vegetable = c('artichoke', 'bean', 'carrot')

food_list = list(fruit, vegetable)

# we make up some data to match

dummy = tibble(
    id = 1:10 
  ) %>% 
  rowwise() %>% 
  mutate(
    entry = paste(
      paste(
        sample(fruit, 
               sample(0:5, 1),
               replace = T
        ),
        collapse = ' '),
      paste(
        sample(vegetable, 
               sample(0:5, 1),
               replace = T
        ),
        collapse = ' '),
      paste(
        sample(c('filler1', 'filler2', 'filler3'), 
               sample(0:5, 1),
               replace = T
        ),
        collapse = ' '),
      sep = ' '
      )
  )

The text entries are rows in a large table. I can map through the rows, check each text entry against each match vector, count the total amount of matches per vector, where the match count of the entry "apple banana banana chair" on the match vector c("apple", "banana", "cherry") is 3. I can store these integers in a list column.

# map through data, count overlaps

getMatches = function(dummy){
  dummy %>%
    mutate(
      counts =
        list(
          map_dbl(food_list,
                ~ str_count(entry, .) %>%
                  sum()
          )
        )
    )
}

res = getMatches(dummy)

This slows down fast.

# larger sets

dummy10 = dummy %>% 
  slice(rep(1:n(), each = 10))
dummy100 = dummy %>% 
  slice(rep(1:n(), each = 100))
dummy1000 = dummy %>% 
  slice(rep(1:n(), each = 1000))
dummy10000 = dummy %>% 
  slice(rep(1:n(), each = 10000))

dummies = list(dummy, dummy10, dummy100, dummy1000, dummy10000)

# times

getTimes = function(dummy){
  tictoc::tic('time')
  getMatches(dummy)
}

map(dummies, ~ getTimes(.))
# get time: 0.012 sec elapsed
# get time: 0.016 sec elapsed
# get time: 0.104 sec elapsed
# get time: 0.926 sec elapsed
# get time: 8.569 sec elapsed

What can I do? I can obviously parallelise this, or replace dplyr with data table, or use awk, but I feel like there are fundamental problems with the approach.

Or maybe not, it's just matching a lot of text with a lot of text just takes a very long time?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
等待大神解答

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...