I wasn't able to get the FVH to handle phrase queries correctly, and wound up having to develop my own summarizer. The gist of my approach is discussed here; what I wound up doing is creating an array of objects, one for each term that I pulled from the queries. Each object contains a word index and its position, and whether it was already used in some match. These instances are the TermAtPosition
instances in the sample below. Then, given position span and an array of word identities (indexes) corresponding to a phrase query, I iterated through the array, looking to match all term indexes within the given span. If I found a match, I marked each matching term as being consumed, and added the matching span to a list of matches. I could then use these matches to score sentences. Here is the matching code:
protected void scorePassage(TermPositionVector v, String[] words, int span,
float score, SentenceScore[] scores, Scorer scorer) {
TermAtPosition[] order = getTermsInOrder(v, words);
if (order.length < words.length)
return;
int positions[] = new int[words.length];
List<int[]> matches = new ArrayList<int[]>();
for(int t=0; t<order.length; t++) {
TermAtPosition tap = order[t];
if (tap.consumed)
continue;
int p = 0;
positions[p++] = tap.position;
for(int u=0; u<words.length; u++) {
if (u == tap.termIndex)
continue;
int nextTermPos = spanContains(order, u, tap.position, span);
if (nextTermPos == -1)
break;
positions[p++] = nextTermPos;
}
// got all terms
if (p == words.length)
matches.add(recordMatch(order, positions.clone()));
}
if (matches.size() > 0)
for (SentenceScore sentenceScore: scores) {
for(int[] matchingPositions: matches)
scorer.scorePassage(sentenceScore, matchingPositions, score);
}
}
protected int spanContains(TermAtPosition[] order, int targetWord,
int start, int span) {
for (int i=0; i<order.length; i++) {
TermAtPosition tap = order[i];
if (tap.consumed || tap.position <= start ||
(tap.position > start + span))
continue;
if (tap.termIndex == targetWord)
return tap.position;
}
return -1;
}
This approach seems to work, but it is greedy. Given a sequence "a a b c" it will it match the first a (leaving the second a alone), and then match b and c. I think a bit of recursion or integer programming could be applied to make it less greedy, but I couldn't be bothered, and wanted a faster rather than a more accurate algorithm anyway.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…