java - Lucene: Multi-word phrases as search terms

Question

Welcome To Ask or Share your Answers For Others

java - Lucene: Multi-word phrases as search terms

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - Lucene: Multi-word phrases as search terms

I'm trying to make a searchable phone/local business directory using Apache Lucene.

I have fields for street name, business name, phone number etc. The problem that I'm having is that when I try to search by street where the street name has multiple words (e.g. 'the crescent'), no results are returned. But if I try to search with just one word, e.g 'crescent', I get all the results that I want.

I'm indexing the data with the following:

String LocationOfDirectory = "C:\dir\index";

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
Directory Index = new SimpleFSDirectory(LocationOfDirectory);

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE.34, analyzer);
IndexWriter w = new IndexWriter(index, config);


Document doc = new Document();
doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Analyzed);

w.add(doc);
w.close();

My searches work like this:

int numberOfHits = 200;
String LocationOfDirectory = "C:\dir\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);

WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");

searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

I have tried swapping the wildcard query for a phrase query, first with the entire string and then splitting the string up on white space and wrapping them in a BooleanQuery like this:

String term = "the crescent";
BooleanQuery b = new BooleanQuery();
PhraseQuery p = new PhraseQuery();
String[] tokens = term.split(" ");
for(int i = 0 ; i < tokens.length ; ++i)
{
    p.add(new Term("Street", tokens[i]));
}
b.add(p, BooleanClause.Occur.MUST);

However, this didn't work. I tried using a KeywordAnalyzer instead of a StandardAnalyzer, but then all other types of search stopped working as well. I have tried replacing spaces with other characters (+ and @), and converting queries to and from this form, but that still doesn't work. I think it doesn't work because + and @ are special characters which are not indexed, but I can't seem to find a list anywhere of which characters are like that.

I'm beginning to go slightly mad, does anyone know what I'm doing wrong?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T01:13:27+0000

The reason why you don't get your documents back is that while indexing you're using StandardAnalyzer, which converts tokens to lowercase and removes stop words. So the only term that gets indexed for your example is 'crescent'. However, wildcard queries are not analyzed, so 'the' is included as mandatory part of the query. The same goes for phrase queries in your scenario.

KeywordAnalyzer is probably not very suitable for your use case, because it takes whole field content as a single token. You can use SimpleAnalyzer for the street field -- it will split the input on all non-letter characters and then convert them to lowercase. You can also consider using WhitespaceAnalyzer with LowerCaseFilter. You need to try different options and work out what works best for your data and users.

Also, you can use different analyzers per field (e.g. with PerFieldAnalyzerWrapper) if changing analyzer for that field breaks other searches.

Categories

java - Lucene: Multi-word phrases as search terms

java - Lucene: Multi-word phrases as search terms

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags