Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
805 views
in Technique[技术] by (71.8m points)

postgresql - Similar UTF-8 strings for autocomplete field

Background

Users can type in a name and the system should match the text, even if the either the user input or the database field contains accented (UTF-8) characters. This is using the pg_trgm module.

Problem

The code resembles the following:

  SELECT
    t.label
  FROM
    the_table t
  WHERE
    label % 'fil'
  ORDER BY
    similarity( t.label, 'fil' ) DESC

When the user types fil, the query matches filbert but not filé powder. (Because of the accented character?)

Failed Solution #1

I tried to implement an unaccent function and rewrite the query as:

  SELECT
    t.label
  FROM
    the_table t
  WHERE
    unaccent( label ) % unaccent( 'fil' )
  ORDER BY
    similarity( unaccent( t.label ), unaccent( 'fil' ) ) DESC

This returns only filbert.

Failed Solution #2

As suggested:

CREATE EXTENSION pg_trgm;
CREATE EXTENSION unaccent;

CREATE OR REPLACE FUNCTION unaccent_text(text)
  RETURNS text AS
$BODY$
  SELECT unaccent($1); 
$BODY$
  LANGUAGE sql IMMUTABLE
  COST 1;

All other indexes on the table have been dropped. Then:

CREATE INDEX label_unaccent_idx 
ON the_table( lower( unaccent_text( label ) ) );

This returns only one result:

  SELECT
    t.label
  FROM
    the_table t
  WHERE
    label % 'fil'
  ORDER BY
    similarity( t.label, 'fil' ) DESC

Question

What is the best way to rewrite the query to ensure that both results are returned?

Thank you!

Related

http://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.0#Unaccent_filtering_dictionary

http://postgresql.1045698.n5.nabble.com/index-refuses-to-build-td5108810.html

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You are not using the operator class provided by the pg_trgm module. I would create an index like this:

CREATE INDEX label_Lower_unaccent_trgm_idx
ON test_trgm USING gist (lower(unaccent_text(label)) gist_trgm_ops);

Originally, I had a GIN index here, but I later learned that a GiST is probably even better suited for this kind of query because it can return values sorted by similarity. More details:

Your query has to match the index expression to be able to make use of it.

SELECT label
FROM   the_table
WHERE  lower(unaccent_text(label)) % 'fil'
ORDER  BY similarity(label, 'fil') DESC -- it's ok to use original string here

However, "filbert" and "filé powder" are not actually very similar to "fil" according to the % operator. I suspect what you really want is this:

SELECT label
FROM   the_table
WHERE  lower(unaccent_text(label)) ~~ '%fil%'
ORDER  BY similarity(label, 'fil') DESC -- it's ok to use original string here

This will find all strings containing the search string, and sort the best matches according to the % operator first.

And the juicy part: the expression can use a GIN or GiST index since PostgreSQL 9.1! I quote the manual on the pg_trgm moule:

Beginning in PostgreSQL 9.1, these index types also support index searches for LIKE and ILIKE, for example


If you actually meant to use the % operator:

Have you tried lowering the threshold for the similarity operator % with set_limit():

SELECT set_limit(0.1);

or even lower? The default is 0.3. Just to see whether its the threshold that filters additional matches.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...