I'm trying to understand how fieldNorm
is calculated (at index time) and then used (and apparentlly re-calculated) at query time.
In all the examples I'm using the StandardAnalyzer with no stop words.
Deugging the DefaultSimilarity
's computeNorm
method while indexing stuff, I've noticed that for 2 particular documents it returns:
- 0.5 for document A (which has 4 tokens in its field)
- 0.70710677 for document B (which has 2 tokens in its field)
It does this by using the formula:
state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
where boost is always 1
Afterwards, when I query for these documents I see that in the query explain I get
0.5 = fieldNorm(field=titre, doc=0)
for document A
0.625 = fieldNorm(field=titre, doc=1)
for document B
This is already strange (to me, I'm sure it's me who's missing something). Why don't I get the same values for field norm as those calculated at index time? Is this the "query normalization" thing in action? If so, how does it work?
This however is more or less ok since the two query-time fieldNorms give the same order as those calculated at index time (the field with the shorter value has the higher fieldNorm in both cases)
I've then made my own Similarity class where I've implemented the computeNorms method like so:
public float computeNorm(String pField, FieldInvertState state) {
norm = (float) (state.getBoost() + (1.0d / Math.sqrt(state.getLength())));
return norm;
}
At index time I now get:
- 1.5 for document A (which has 4 tokens in its field)
- 1.7071068 for document B (which has 2 tokens in its field)
However now, when I query for these documents, I can see that they both have the same field norm as reported by the explain function:
1.5 = fieldNorm(field=titre, doc=0)
for document A
1.5 = fieldNorm(field=titre, doc=1)
for document B
To me, this is now really strange, how come if I use an apparently good similarity to calculate the fieldNorm at index time, which gives me proper values proportional to the number of tokens, later on, at query time, all this is lost and the query sais both documents have the same field norm?
So my questions are:
- why does the index time fieldNorm as reported by the Similarity's computeNorm method not remain the same as that reported by query explain?
- why, for two different fieldNorm values obtained at index time (via similarity computeNorm) I get identical fieldNorm values at query time?
== UPDATE
Ok, I've found something in Lucene's docs which clarifies some of my question, but not all of it:
However the resulted norm value is encoded as a single byte before being stored. At search time, the norm byte value is read from the index directory and decoded back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75.
How much precision loss is there? Is there a minimum gap we should put between different values so that they remain different even after the precision-loss re-calculations?
See Question&Answers more detail:
os