From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Erik Rijkers <er(at)xs4all(dot)nl> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: trgm regex index peculiarity |
Date: | 2013-06-21 13:11:01 |
Message-ID: | CAPpHfdu1+Y6mvrg2PU5raEE5efziVKztbR30Y1nR=Fz_bL6eyg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> On Fri, June 21, 2013 05:25, Tom Lane wrote:
> > "Erik Rijkers" <er(at)xs4all(dot)nl> writes:
> >> In a 112 MB test table (containing random generated text) with a trgm
> index (gin_trgm_ops), I consistently get these
> >> timings:
> >> select txt from azjunk6 where txt ~ '^abcd';
> >> 130 ms
> >> select txt from azjunk6
> >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
> >> 3 ms
> >
> > Hm, could you provide a self-contained test case?
> >
>
> yes, sorry. I tested on a 1M row table:
>
> #!/bin/sh
>
> # create table:
> for power in 6;
> do
> table=azjunk${power}
> index=${table}_trgm_re_idx
> perl -E'
> sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
> say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z")) for 1 ..
> 1e'"${power};" \
> | psql -aqXc "
> drop table if exists $table;
> create table $table(txt text);
> copy $table from stdin;";
> echo "set session maintenance_work_mem='1GB';
> create index $index on $table using gin (txt gin_trgm_ops);
> analyze $table;" | psql -qtAX;
> done
>
> # test:
> echo "
> \\timing on
> explain analyze select txt from azjunk6 where txt ~ '^abcd'; -- slow (140
> ms)
> explain analyze select txt from azjunk6 where txt ~ 'abcd' and
> substr(txt,1,4) = 'abcd'; -- fast (5 ms)
> " | psql -Xqa
Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'.
However trigrams '__a' is much more frequent than '_ab' which in turn is
much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of
'__a' and '_ab' and that gives so significant speedup.
The problem is that trigram regex engine doesn't know that '__a' and '_ab'
is more frequent than another trigrams because we don't know their
distribution. However, simple assumption that blank character (in pg_trgm
meaning) is much more frequent than others seems to be true for most cases.
Attached patch implementing this assumption. It introduces BLANK_COLOR_SIZE
macro which is used in selectColorTrigrams like count of characters in
COLOR_BLANK. Another change in this patch is split of MAX_TRGM_COUNT into
WISH_TRGM_SIZE and MAX_TRGM_SIZE. The idea is to keep trying to remove
color trigrams from graph even when it fits into MAX_TRGM_SIZE, because we
are intending to scan less posting lists/trees.
Comments is not fixed yet, coming soon.
------
With best regards,
Alexander Korotkov.
Attachment | Content-Type | Size |
---|---|---|
trgm_regex_optimize.1.patch | application/octet-stream | 5.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2013-06-21 13:21:07 | Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls |
Previous Message | Michael Paquier | 2013-06-21 11:54:34 | Re: Support for REINDEX CONCURRENTLY |