Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
864 views
in Technique[技术] by (71.8m points)

json - Elasticsearch "More Like This" API vs. more_like_this query

Elasticsearch has two similar features to get "similar" documents:

There is the "More Like This API". It gives me documents similar to a given one. I can't use it in more complex expressions though.

There is also the "more_like_this" query for use in the Search API I can use it in bool or boosting expressions, but I can't give it an id of a document. I have to provide the "like_text" parameter.

I have documents with tags and content. Some documents will have good tags and some won't have any. I want a "Similar documents" feature that will work every time but will rank documents with matching tags higher than documents with matching text. My idea was:

{
    "boosting" : {
        "positive" : {
            "more_like_this" : {
                "fields" : ["tag"],
                "id" : "23452",
                "min_term_freq" : 1
            }
        },
        "negative" : {
            "more_like_this" : {
                "fields" : ["tag"],
                "id" : "23452",
            }
        },
        "negative_boost" : 0.2
    }
}

Obviously this doesn't work because there is no "id" in "more_like_this". What are the alternatives?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First of all a little introduction about the more like this functionality and how it works. The idea is that you have a specific document and you want to have some others that are similar to it.

In order to achieve this we need to extract some content out of the current document and use it to make a query to get similar ones. We can extract content from the lucene stored fields (or the elasticsearch _source field, which is effectively a stored field in lucene) and somehow reanalyze it or use the information stored in the term vectors (if enabled while indexing) to get a list of terms that we can use to query, without having to reanalyze the text. I'm not sure whether elasticsearch tries this latter approach if term vectors are available though.

The more like this query allows you to provide a text, regardless of where you got it from. That text will be used to query the fields that you select and get back similar documents. The text will not be entirely used, but reanalyzed, and only a maximum of max_query_terms (default 25) will be kept, out of the terms that have at least the provided min_term_freq (minimum term frequency, default 2) and document frequency between min_doc_freq and max_doc_freq. There are more parameters too that can influence the generated query.

The more like this api goes one step further, allowing to provide the id of a document and, again, a list of fields. The content of those fields will be extracted from that specific document and used to make a more like this query on the same fields. That means that the generated more like this query will have the property text containing the text previously extracted and will be performed on the same fields. As you can see the more like this api executes a more like this query under the hood.

Let's say the more like this query gives you more flexibility, since you can combine it with other queries and you can get the text from whatever source you like. On the other hand the more like this api exposes the common functionality doing some more work for you but with some restrictions.

In your case I would combine a couple of different more like this queries together, so that you can make use of the powerful elasticsearch query DSL, boost queries differently and so on. The downside is that you have to provide the text yourself, since you can't provide the id of the document to extract it from.

There are different ways to achieve what you want. I would use a bool query to combine the two more like this queries in a should clause and give them a different weight. I would also use the more like this field query instead, since you want to query a single field at a time.

{
    "bool" : {
        "must" : {
          {"match_all" : { }}
        },
        "should" : [
            {
              "more_like_this_field" : {
                "tags" : {
                  "like_text" : "here go the tags extracted from the current document!",
                  "boost" : 2.0
                }
              }
            },
            {
              "more_like_this_field" : {
                "content" : {
                  "like_text" : "here goes the content extracted from the current document!"
                }
              }
            }
        ],
        "minimum_number_should_match" : 1
    }
}

This way at least one of the should clauses must match, and a match on tags is more important than a match on content.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...