I need to write a custom LemmaTokenFilter, which replaces and indexes the words with their lemmatized(base) form. The problem is, that I get the base forms from an external API, meaning I need to call the API, send my text, parse the response and send it as a Map<String, String>
to my LemmaTokenFilter. The map contains pairs of <originalWord, baseFormOfWord>
. However, I cannot figure out how can I access the full value of the text field, which is being proccessed by the TokenFilters.
One idea is to go through the tokenStream one by one when the LemmaTokenFilter is being created by the LemmaTokenFilterFactory, however I would need to watch out to not edit anything in the tokenStream, somehow reset the current token(since I would need to call the .increment()
method on it to get all the tokens), but most importantly this seems unnecessary, since the field value is already there somewhere and I don't want to spend time trying to put it together again from the tokens. This implementation would probably be too slow.
Another idea would be to just process every token separately, however calling an external API with only one word and then parsing the response is definitely too inefficient.
I have found something on using the ResourceLoaderAware interface, however I don't really understand how could I use this to my advantage. I could probably save the map in a text file before every indexing, but writing to a file, opening it and reading from it before every document indexing seems too slow as well.
So the best way would be to just pass the value of the field as a String to the constructor of LemmaTokenFilter, however I don't know how to access it from the create()
method of the LemmaTokenFilterFactory.
I could not find any help googling it, so any ideas are welcome.
Here's what I have so far:
public final class LemmaTokenFilter extends TokenFilter {
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
private Map<String, String> lemmaMap;
protected LemmaTokenFilter(TokenStream input, Map<String, String> lemmaMap) {
super(input);
this.lemmaMap = lemmaMap;
}
@Override
public boolean incrementToken() throws IOException {
if (input.incrementToken()) {
String term = termAtt.toString();
String lemma;
if ((lemma = lemmaMap.get(term)) != null) {
termAtt.setEmpty();
termAtt.copyBuffer(lemma.toCharArray(), 0, lemma.length());
}
return true;
} else {
return false;
}
}
}
public class LemmaTokenFilterFactory extends TokenFilterFactory implements ResourceLoaderAware {
public LemmaTokenFilterFactory(Map<String, String> args) {
super(args);
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}
@Override
public TokenStream create(TokenStream input) {
return new LemmaTokenFilter(input, getLemmaMap(getFieldValue(input)));
}
private String getFieldValue(TokenStream input) {
//TODO: how?
return "?ach je desková hra pro dva hrá?e, v dne?ní soutě?ní podobě zároveň pova?ovaná i za odvětví sportu.";
}
private Map<String, String> getLemmaMap(String data) {
return UdPipeService.getLemma(data);
}
@Override
public void inform(ResourceLoader loader) throws IOException {
}
}