Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.3k views
in Technique[技术] by (71.8m points)

postgresql - Postgres: index on cosine similarity of float arrays for one-to-many search

Cosine similarity between two equally-sized vectors (of reals) is defined as the dot product divided by the product of the norms.

To represent vectors, I have a large table of float arrays, e.g. CREATE TABLE foo(vec float[])'. Given a certain float array, I need to quickly (with an index, not a seqscan) find the closest arrays in that table by cosine similarity, e.g. SELECT * FROM foo ORDER BY cos_sim(vec, ARRAY[1.0, 4.5, 2.2]) DESC LIMIT 10; But what do I use?

pg_trgm's cosine similarity support is different. It compares text, and I'm not sure what it does exactly. An extension called smlar (here) also has cosine similarity support for float arrays but again is doing something different. What I described is commonly used in data analysis to compare features of documents, so I was thinking there'd be support in Postgres for it.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I gather that no extension that does this, so I've found a limited workaround:

If A and B are both normalized (length 1), cos(A, B) = 1 - 0.5 * ||A - B||^2. ||A - B|| is the Euclidean distance, and cos(A, B) is the cosine similarity. So greater Euclidean distance <=> lesser cosine similarity (makes sense intuitively if you imagine a unit circle), and if you have non-normal vectors, changing their magnitudes without changing their directions doesn't affect their cosine similarities. Great, so I can normalize my vectors and compare their Euclidean distances...

There's a nice answer here about Cube, which supports n-dimensional points and GiST indexes on Euclidean distance, but it only supports 100 or fewer dimensions (can be hacked higher, but I had issues around 135 and higher, so now I'm afraid). Also requires Postgres 9.6 or later.

So:

  1. Make sure I don't care about having at most 100 dimensions. Upgrade to Postgres 9.6 or later.
  2. Fill my table with arrays to represent vectors.
  3. Normalize the vectors to create an extra column of cube points. Create a GiST index on this column.
  4. Order by Euclidean distance ascending to get cosine similarity descending: EXPLAIN SELECT * FROM mytable ORDER BY normalized <-> cube(array[1,2,3,4,5,6,7,8,9,0]) LIMIT 10;

If I need more than 100 dimensions, I might be able to achieve this using multiple indexed columns. Will update the answer in that case.

Update: Pretty sure there's nothing I can do with splitting the >100-dimension vector into multiple columns. I end up having to scan the entire table.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

57.0k users

...