performance - Efficient data structure for word lookup with wildcards

Question

Welcome To Ask or Share your Answers For Others

performance - Efficient data structure for word lookup with wildcards

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

performance - Efficient data structure for word lookup with wildcards

I need to match a series of user inputed words against a large dictionary of words (to ensure the entered value exists).

So if the user entered:

"orange" it should match an entry "orange' in the dictionary.

Now the catch is that the user can also enter a wildcard or series of wildcard characters like say

"or__ge" which would also match "orange"

The key requirements are:

* this should be as fast as possible.

* use the smallest amount of memory to achieve it.

If the size of the word list was small I could use a string containing all the words and use regular expressions.

however given that the word list could contain potentially hundreds of thousands of enteries I'm assuming this wouldn't work.

So is some sort of 'tree' be the way to go for this...?

Any thoughts or suggestions on this would be totally appreciated!

Thanks in advance, Matt

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:45:29+0000

Put your word list in a DAWG (directed acyclic word graph) as described in Appel and Jacobsen's paper on the World's Fastest Scrabble Program (free copy at Columbia). For your search you will traverse this graph maintaining a set of pointers: on a letter, you make a deterministic transition to children with that letter; on a wildcard, you add all children to the set.

The efficiency will be roughly the same as Thompson's NFA interpretation for grep (they are the same algorithm). The DAWG structure is extremely space-efficient—far more so than just storing the words themselves. And it is easy to implement.

Worst-case cost will be the size of the alphabet (26?) raised to the power of the number of wildcards. But unless your query begins with N wildcards, a simple left-to-right search will work well in practice. I'd suggest forbidding a query to begin with too many wildcards, or else create multiple dawgs, e.g., dawg for mirror image, dawg for rotated left three characters, and so on.

Matching an arbitrary sequence of wildcards, e.g., ______ is always going to be expensive because there are combinatorially many solutions. The dawg will enumerate all solutions very quickly.

Categories

performance - Efficient data structure for word lookup with wildcards

performance - Efficient data structure for word lookup with wildcards

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags