From: | Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com |
Subject: | Re: WIP: index support for regexp search |
Date: | 2012-12-03 10:05:16 |
Message-ID: | 50BC795C.5000509@vmware.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 02.12.2012 20:19, Tom Lane wrote:
> Alexander Korotkov<aekorotkov(at)gmail(dot)com> writes:
>> Nice idea to delay expanding colors to characters! Obviously, we should
>> delay expanding inly alphanumerical characters. Because non-alphanumberical
>> characters influence graph structure. Trying to implement...
>
> Uh, why would that be? Colors are colors. The regexp machinery doesn't
> care whether they represent alphanumerics or not. (Or to be more
> precise, if there is a situation where it makes a difference, separate
> colors will have been created for each set of characters that need to be
> distinguished.)
The regexp machinery doesn't care, but the trigrams that pg_trgm
extracts only contain alphanumeric characters. So if by looking at the
CNFA graph produced by the regexp machinery you conclude that any
matching strings must contain three-letter sequences "%oo" and "#oo",
you can just club them together into " oo" trigram.
I think you can run a pre-processing step to the colors, and merge
colors that are equivalent as far as trigrams are considered. For
example, if you have a color that contains only character '%', and
another that contains character '#', you can treat them as the same
hcolor. You might then be able to simplify the CNFA. Actually, it would
be even better if you could apply the pre-processing to the regexp
before the regexp machinery turns it into a CNFA. Not sure how easy it
would be to do such pre-processing.
BTW, why create the path matrix? You could check the "check" array of
trigrams in the consistent function directly against the graph.
Consistent should return true, iff there is a path through the graph
following only arcs that contain trigrams present in the check array.
Finding a path through a complex graph could be expensive, O(|E|), but
if the path is complex, the path matrix would be large as well, and
checking against a large matrix isn't exactly free either. It would
allow you to avoid "overflows" caused by having too many paths through
the graph.
- Heikki
From | Date | Subject | |
---|---|---|---|
Next Message | Dimitri Fontaine | 2012-12-03 10:27:40 | Re: ALTER command reworks |
Previous Message | Simon Riggs | 2012-12-03 09:56:51 | Re: Enabling Checksums |