I gather that no extension that does this, so I've found a limited workaround:
If A and B are both normalized (length 1), cos(A, B) = 1 - 0.5 * ||A - B||^2
. ||A - B||
is the Euclidean distance, and cos(A, B)
is the cosine similarity. So greater Euclidean distance <=> lesser cosine similarity (makes sense intuitively if you imagine a unit circle), and if you have non-normal vectors, changing their magnitudes without changing their directions doesn't affect their cosine similarities. Great, so I can normalize my vectors and compare their Euclidean distances...
There's a nice answer here about Cube, which supports n-dimensional points and GiST indexes on Euclidean distance, but it only supports 100 or fewer dimensions (can be hacked higher, but I had issues around 135 and higher, so now I'm afraid). Also requires Postgres 9.6 or later.
So:
- Make sure I don't care about having at most 100 dimensions. Upgrade to Postgres 9.6 or later.
- Fill my table with arrays to represent vectors.
- Normalize the vectors to create an extra column of
cube
points. Create a GiST index on this column.
- Order by Euclidean distance ascending to get cosine similarity descending:
EXPLAIN SELECT * FROM mytable ORDER BY normalized <-> cube(array[1,2,3,4,5,6,7,8,9,0]) LIMIT 10;
If I need more than 100 dimensions, I might be able to achieve this using multiple indexed columns. Will update the answer in that case.
Update: Pretty sure there's nothing I can do with splitting the >100-dimension vector into multiple columns. I end up having to scan the entire table.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…