Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

postgresql - speeding up wildcard text lookups

I have a simple table in Postgres with a bit over 8 million rows. The column of interest holds short text strings, typically one or more words total length less than 100 characters. It is set as 'character varying (100)'. The column is indexed. A simple look up like below takes > 3000 ms.

SELECT a, b, c FROM t WHERE a LIKE '?%'

Yes, for now, the need is to simply find the rows where "a" starts with the entered text. I want to bring the speed of look up down to under 100 ms (the appearance of instantaneous). Suggestions? Seems to me that full text search won't help here as my column of text is too short, but I would be happy to try that if worthwhile.

Oh, btw I also loaded the exact same data in mongodb and indexed column "a". Loading the data in mongodb was amazingly quick (mongodb++). Both mongodb and Postgres are pretty much instantaneous when doing exact lookups. But, Postgres actually shines when doing trailing wildcard searches as above, consistently taking about 1/3 as long as mongodb. I would be happy to pursue mongodb if I could speed that up as this is only a readonly operation.

Update: First, a couple of EXPLAIN ANALYZE outputs

EXPLAIN ANALYZE SELECT a, b, c FROM t WHERE a LIKE 'abcd%'

"Seq Scan on t  (cost=0.00..282075.55 rows=802 width=40) 
    (actual time=1220.132..1220.132 rows=0 loops=1)"
"  Filter: ((a)::text ~~ 'abcd%'::text)"
"Total runtime: 1220.153 ms"

I actually want to compare Lower(a) with the search term which is always at least 4 characters long, so

EXPLAIN ANALYZE SELECT a, b, c FROM t WHERE Lower(a) LIKE 'abcd%'

"Seq Scan on t  (cost=0.00..302680.04 rows=40612 width=40) 
    (actual time=4.681..3321.387 rows=788 loops=1)"
"  Filter: (lower((a)::text) ~~ 'abcd%'::text)"
"Total runtime: 3321.504 ms"

So I created an index

CREATE INDEX idx_t ON t USING btree (Lower(Substring(a, 1, 4) ));

"Seq Scan on t  (cost=0.00..302680.04 rows=40612 width=40) 
    (actual time=3243.841..3243.841 rows=0 loops=1)"
"  Filter: (lower((a)::text) = 'abcd%'::text)"
"Total runtime: 3243.860 ms"

Seems the only time an index is being used is when I am looking for an exact match

EXPLAIN ANALYZE SELECT a, b, c FROM t WHERE a = 'abcd'

"Index Scan using idx_t on geonames  (cost=0.00..57.89 rows=13 width=40) 
    (actual time=40.831..40.923 rows=17 loops=1)"
"  Index Cond: ((ascii_name)::text = 'Abcd'::text)"
"Total runtime: 40.940 ms"

Found a solution by implementing an index with varchar_pattern_ops, and am now looking for an even quicker lookups.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The PostgreSQL query planner is smart, but not an AI. To make it use an index on an expression use the exact same form of expression in the query.

With an index like this:

CREATE INDEX t_a_lower_idx ON t (lower(substring(a, 1, 4)));

Or simpler in PostgreSQL 9.1:

CREATE INDEX t_a_lower_idx ON t (lower(left(a, 4)));

Use this query:

SELECT * FROM t WHERE lower(left(a, 4)) = 'abcd';

Which is 100% functionally equivalent to:

SELECT * FROM t WHERE lower(a) LIKE 'abcd%'

Or:

SELECT * FROM t WHERE a ILIKE 'abcd%'

But not:

SELECT * FROM t WHERE a LIKE 'abcd%'

This is a functionally different query and you need a different index:

CREATE INDEX t_a_idx ON t (substring(a, 1, 4));

Or simpler with PostgreSQL 9.1:

CREATE INDEX t_a_idx ON t (left(a, 4));

And use this query:

SELECT * FROM t WHERE left(a, 4) = 'abcd';

Left anchored search terms of variable length

Case insensitive. Index:

Edit: Almost forgot: If you run your db with any other locale than the default 'C', you need to specify the operator class explicitly - text_pattern_ops in my example:

CREATE INDEX t_a_lower_idx
ON t (lower(left(a, <insert_max_length>)) text_pattern_ops);

Query:

SELECT * FROM t WHERE lower(left(a, <insert_max_length>)) ~~ 'abcdef%';

Can utilize the index and is almost as fast as the variant with a fixed length.

You may be interested in this post on dba.SE with more details about pattern matching, especially the last part about the operators ~>=~ and ~<~.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...