Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
157 views
in Technique[技术] by (71.8m points)

java - Can I get the Field value in String into custom TokenFilter in Apache Solr?

I need to write a custom LemmaTokenFilter, which replaces and indexes the words with their lemmatized(base) form. The problem is, that I get the base forms from an external API, meaning I need to call the API, send my text, parse the response and send it as a Map<String, String> to my LemmaTokenFilter. The map contains pairs of <originalWord, baseFormOfWord>. However, I cannot figure out how can I access the full value of the text field, which is being proccessed by the TokenFilters.

One idea is to go through the tokenStream one by one when the LemmaTokenFilter is being created by the LemmaTokenFilterFactory, however I would need to watch out to not edit anything in the tokenStream, somehow reset the current token(since I would need to call the .increment() method on it to get all the tokens), but most importantly this seems unnecessary, since the field value is already there somewhere and I don't want to spend time trying to put it together again from the tokens. This implementation would probably be too slow.

Another idea would be to just process every token separately, however calling an external API with only one word and then parsing the response is definitely too inefficient.

I have found something on using the ResourceLoaderAware interface, however I don't really understand how could I use this to my advantage. I could probably save the map in a text file before every indexing, but writing to a file, opening it and reading from it before every document indexing seems too slow as well.

So the best way would be to just pass the value of the field as a String to the constructor of LemmaTokenFilter, however I don't know how to access it from the create() method of the LemmaTokenFilterFactory.

I could not find any help googling it, so any ideas are welcome.

Here's what I have so far:

public final class LemmaTokenFilter extends TokenFilter {

    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    private Map<String, String> lemmaMap;

    protected LemmaTokenFilter(TokenStream input, Map<String, String> lemmaMap) {
        super(input);
        this.lemmaMap = lemmaMap;
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (input.incrementToken()) {
            String term = termAtt.toString();
            String lemma;
            if ((lemma = lemmaMap.get(term)) != null) {
                termAtt.setEmpty();
                termAtt.copyBuffer(lemma.toCharArray(), 0, lemma.length());
            }
            return true;
        } else {
            return false;
        }
    }
}
public class LemmaTokenFilterFactory extends TokenFilterFactory implements ResourceLoaderAware {

    public LemmaTokenFilterFactory(Map<String, String> args) {
        super(args);
        if (!args.isEmpty()) {
            throw new IllegalArgumentException("Unknown parameters: " + args);
        }
    }

    @Override
    public TokenStream create(TokenStream input) {
        return new LemmaTokenFilter(input, getLemmaMap(getFieldValue(input)));
    }

    private String getFieldValue(TokenStream input) {
        //TODO: how?
        return "?ach je desková hra pro dva hrá?e, v dne?ní soutě?ní podobě zároveň pova?ovaná i za odvětví sportu.";
    }

    private Map<String, String> getLemmaMap(String data) {
        return UdPipeService.getLemma(data);
    }

    @Override
    public void inform(ResourceLoader loader) throws IOException {

    }
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

1. API based approach: You can create an Analysis Chain with the Custom lemmatizer on top. To design this lemmatizer, I guess you can look at the implementation of the Keyword Tokenizer; Such that you can read everything whatever is there inside the input and then call your API; Replace all your tokens from the API response in the input text; After that in Analysis Chain, use standard or white space tokenizer to tokenized your data.

2. File-Based Approach It will follow all the same steps, except calling the API it can use the hashmap, from the files mentioned while defining the TokenStream

Now coming to the ResourceLoaderAware: It is required when you need to indicate your Tokenstream that resource has changed it has inform method which takes care of that. For reference, you can look into StemmerOverrideFilter

  • Keyword Tokenizer: Emits the entire input as a single token.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...