I’m currently trying to do a hybrid search on two indexes: a full text index and knn_vector (word embeddings) index. Currently, over 10’000 documents from Wikipedia are indexed on an ES stack, indexed on both of these fields (see mapping: “content”, “embeddings”). The queries are well known n-grams (1,2,3) that should yield results (words are taken from the wikipedia pages that are indexed).
It is also important to note that the knn_vector index is defined as a nested object.
This is the current mapping of the items indexed:
mapping = {
"settings": {
"index": {
"knn": True,
"knn.space_type": "cosinesimil"
}
},
"mappings": {
"dynamic": 'strict',
"properties": {
"elasticId":
{ 'type': 'text' },
"owners":
{ 'type': 'text' },
"type":
{ 'type': 'keyword' },
"accessLink":
{ 'type': 'keyword' },
"content":
{ 'type': 'text'},
"embeddings": {
'type': 'nested',
"properties": {
"vector": {
"type": "knn_vector",
"dimension": VECTOR_DIM,
},
},
},
}
My goal is to compare the query scores on both indexes to understand if one is more efficient than the other (full text vs. knn_vectors), and how elastic chooses to return an object from based on the score of each index.
I understand I could simply split the queries (two separate queries), but ideally, we might want to use a hybrid search of this type in production.
This is the current query that searches on both full text and the knn_vectors:
def MakeHybridSearch(query):
query_vector = convert_to_embeddings(query)
result = elastic.search({
"explain": True,
"profile": True,
"size": 2,
"query": {
"function_score": { #function_score
"functions": [
{
"filter": {
"match": {
"text": {
"query": query,
'boost': "5",
},
},
},
"weight": 2
},
{
"filter": {
'script': {
'source': 'knn_score',
'params': {
'field': 'doc_vector',
'vector': query_vector,
'space_type': "l2"
}
}
},
"weight": 4
}
],
"max_boost": 5,
"score_mode": "replace",
"boost_mode": "multiply",
"min_score": 5
}
}
}, index='files_en', size=1000)
The current problem is that all queries are not returning anything.
Result:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
Even when the query does return a response, it returns hits with a score of 0 (score =0).
Is there an error in the query structure ? Could this be on the mapping side ? If not, is there a better of way of doing this ?
Thank you for your help !