Searching on expression indexes
I'm building an expression index that substrings a source field to avoid overflowing the 2172 character limit on B-trees:
CREATE INDEX record_changes_log_detail_old_value_ix_btree
ON record_changes_log_detail
USING btree ((substring(old_value,1,1024)::text) text_pattern_ops);
For the record:
-- Postgres 11.4 on RDS, 11.5 on macOS at home.
-- The record_changes_log_detail table has about 8M in my test setup.
-- The old_value field is of type citext.
-- Values in the field range in length from 1 character to over 5,000. Most are short.
This search uses the index specified above:
select * from record_changes_log_detail
where substring(old_value,1,1024) = 'Gold Kerrison Neuro';
This search does not use the index:
where old_value = 'Gold Kerrison Neuro';
I found this surprising, and a real bummer. If I understand correctly a comment from jjanes on another question, the planner only recognizes that the index applies when your query statement uses exactly the same expression. In other words, anyone writing a query needs to know the details of the index definition or else the index won't be used.
I had been assuming that when the expression index was constructed, the abbreviated/extracted/etc. value was stored and that the planner would check it. Is there any way to hint the planner other than to recapitulate the entire expression? The index has the right data, but the planner seems to skip over it.
I'm adding a bit of detail based on Erwin Brandstetter's answer:
I have lots of similar situations, which is why I'm digging in here on the details. In this case, of my ~8M rows, only 6 have values longer than 2172 characters, and 99.93% of the values are 100 characters or less.
What I'm hoping for is an approach that is easy for someone else to pick up. A shadow-field might well be the answer as having to know the exact details of index constructions strikes me as exactly the wrong sort of visibility. A shadow field doesn't suffer from that problem, once you know to use it. I could either populate it with LEFT(old_field,128) or some other length, or texthash(old_field), as you mentioned. I'll experiment with that. My data is so skewed to short values, that hashing seems to lead to a high collision rate.
For what it's worth, the team and I are coming from a system where text fields are silently trimmed to 1024 characters when indexed in a B-tree. It's completely transparent to the user and searches consult the index. Apples and spark plugs, I know. The point is that I'm not expecting Postgres to be an AI, but I am coming with inaccurate priors. So thanks to you and everyone else for helping me learn more about how Postgres actually works.
Follow-up
This question has been answered, but I want to add some follow-up for the archives. I've been learning a lot from old answers, some very old. So here's a bit of information for the future. I tried out four solutions:
- B-tree on a portion of the citext field.
- B-tree on a hash of the citext field.
- Hash index of the citext field.
- Tri-gram GIN index of the citext field.
Since there doesn't appear to be any way to get LIKE-type queries on citext where the text may be too long, the goal is to create an index for =. Any of the three above will work okay, but they differ quite a bit. Here's some setup code for the test:
DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_btree;
CREATE INDEX record_changes_log_detail_old_value_ix_btree
ON record_changes_log_detail
USING btree ((left(old_value,1024)::citext) citext_pattern_ops);
DROP INDEX IF EXISTS record_changes_log_detail_old_value_hash_ix_btree;
CREATE INDEX record_changes_log_detail_old_value_hash_ix_btree
ON record_changes_log_detail
USING btree (hashtext(old_value));
DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_hash;
CREATE INDEX record_changes_log_detail_old_value_ix_hash
ON record_changes_log_detail
USING hash (old_value);
DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_tgrm;
CREATE INDEX record_changes_log_detail_old_value_ix_tgrm
ON record_changes_log_detail
USING gin (old_value gin_trgm_ops);
VACUUM ANALYZE;
These indexes each work to find a record, but with different syntax:
-- Uses the LEFT()::citext index
explain analyze
select * from record_changes_log_detail
where left(old_value,1024)::citext = 'Gold Kerrison Neuro';
-- Uses the HASH index
explain analyze
select * from record_changes_log_detail
where old_value = 'Gold Kerrison Neuro';
-- Uses the HASHTEXT() index
explain analyze
select * from record_changes_log_detail
where hashtext(old_value) = hashtext('Gold Kerrison Neuro');
-- Uses the tri-gram() index
explain analyze
select * from record_changes_log_detail
where old_value::text LIKE '%Gold Kerrison Neuro%';
The hash index provides the nicest syntax because it's transparent...but the hash index is the worst in every other way. Here's a size search and results. I've added reported index building times manually here.
select
'B-tree on LEFT(old_value,1024)::citext' as index_description,
pg_size_pretty(pg_relation_size ('record_changes_log_detail_old_value_ix_btree')) as pretty
union all
select
'B-tree on HASHTEXT(old_value)' as index_description,
pg_size_pretty(pg_relation_size ('record_changes_log_detail_old_value_hash_ix_btree')) as pretty
union all
select
'Hash index on old_value' as index_description,
pg_size_pretty(pg_relation_size ('record_changes_log_detail_old_value_ix_hash')) as pretty
union all
select
'GIN tri-gram index on old_value' as index_description,
pg_size_pretty(pg_relation_size ('record_changes_log_detail_old_value_ix_tgrm')) as pretty;
index_description pretty seconds
B-tree on LEFT(old_value,1024)::citext 238 MB 38
B-tree on HASHTEXT(old_value) 166 MB 7
Hash index on old_value 362 MB 3,802
GIN tri-gram index on old_value 106 MB 56
I'd say this data is a terrible match for a hash index, so please don't take those results as typical. Still, the time and size are pretty bad. The clear winner for = searches is Erwin Brandstetter's clever suggestion to B-tree a hash. Nice! The extra syntactic sugar needed for the search isn't as bad here as for the LEFT-based index. Looking forward, this will benefit from the B-tree improvements promised in PG 12.
And more good news, the tri-gram index is awesome. Laurenz Albe suggested trying it out, and I'm happy I did. Instantaneous contains/like searches, perfect. That's just what I need. Here again, I doubt that the index size is typical...my data is weird. For those using citext, note that you have to cast the search condition to text for the index to be used:
select * from record_changes_log_detail
where old_value::text LIKE '%Gold Kerrison Neuro%';
For those who don't know, tri-grams are an instance of n-grams with a length of 3. N-grams are sometimes called q-grams or k-grams. Whatever, it's the same thing. Of all of the naive (non-probabilistic or statistical) fuzzy text matching algorithms, it's probably the best. Robust over different data sets and languages, flexible, awesome. So I'm super pleased at how well it works in Postgres.
See Question&Answers more detail:
os