Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
668 views
in Technique[技术] by (71.8m points)

algorithm - Best data structure for crossword puzzle search

I have a large database for solving crossword puzzles, consisting of a word and a description. My application allows searching for words of a specific length and characters on specific positions (this is done the hard way ... go through all words and check each). Plus a search by description (if necessary)

For instance find word _ _ A _ _ B (6 letter word, third character A and last B)

I would like to index the words in such way that the searching would be really fast. My first idea was to use a balanced tree structure, any other suggestion?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Okay, I am going to propose something weird, but coming from C++ I have been using Boost for a long time and I have come to see the MultiIndex library.

The idea of this library is to create one collection, but have many different ways to query it. It could model, in fact, a database.

So, let's put our words in a table, and put the necessary indexes in place:

word                     |length|c0|c1|c2| ... |c26|
-------------------------|------|--|--|--| ... |---|
Singapour                |9     |S |i |n | ... |0  |

Now the query will look like:

Select word From table Where length=9 And c2='n' And c8='u';

Easy enough isn't it ?

For maximum efficiency, the table should be partitioned on the length, and the indexes (one per cX column) should be local to the partition.

For an in-memory solution you would have one container per length, containing as many indexes as the length, each index being a hash table pointing to a sorted list (easier merge)

Here is a python description:

class Dictionary:
  def __init__(self, length):
    self.length = length
    self.words = set([])
    self.indexes = collections.defaultdict(set)

  def add(self, word):
    if len(word) != self.length:
      raise RuntimeException(word + ' is not ' + `self.length` + ' characters long')

    if word in self.words:
      raise RuntimeException(word + ' is already in the dictionary')

    self.words.add(word)

    for i in range(0,length):
      self.indexes[(i,word[i])].add(word)

  def search(self, list):
    """list: list of tuples (position,character)
    """
    def compare(lhs,rhs): return cmp(len(lhs),len(rhs))

    sets = [self.indexes[elem] for elem in list]
    sets.sort(compare)
    return reduce(intersection, sets)

I have voluntarily provided the length argument, to minimize the size of the hashes and thus make the search better. Also, the sets are sorted by length so that the computation of the intersection be better :)

Go ahead and test it against other solutions if you wish :)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...