Given a list ["one", "two", "three"]
, how to determine if each word exist in a specified string?
The word list is pretty short (in my case less than 20 words), but the strings to be searched is pretty huge (400,000 strings for each run)
My current implementation uses re
to look for matches but I'm not sure if it's the best way.
import re
word_list = ["one", "two", "three"]
regex_string = "(?<=W)(%s)(?=W)" % "|".join(word_list)
finder = re.compile(regex_string)
string_to_be_searched = "one two three"
results = finder.findall(" %s " % string_to_be_searched)
result_set = set(results)
for word in word_list:
if word in result_set:
print("%s in string" % word)
Problems in my solution:
- It will search until the end of the string, although the words may appear in the first half of the string
- In order to overcome the limitation of lookahead assertion (I don't know how to express "the character before current match should be non-word characters, or the start of the string"), I added extra space before and after the string I need to be searched.
- Other performance issue introduced by the lookahead assertion?
Possible simpler implementation:
- just loop through the word list and do a
if word in string_to_be_searched
. But it can not deal with "threesome" if you are looking for "three"
- Use one regular expression search for one word. Still I'm not sure about the performance, and the potential of searching string multiple times.
UPDATE:
I've accepted Aaron Hall's answer https://stackoverflow.com/a/21718896/683321 because according to Peter Gibson's benchmark https://stackoverflow.com/a/21742190/683321 this simple version has the best performance. If you are interested in this problem, you can read all the answers and get a better view.
Actually I forgot to mention another constraint in my original problem. The word can be a phrase, for example: word_list = ["one day", "second day"]
. Maybe I should ask another question.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…