Re: [HACKERS] Broken select on regular expression !!!

From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Constantin Teodorescu <teo(at)flex(dot)ro>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Broken select on regular expression !!!
Date: 1999-05-21 01:57:14
Message-ID: 199905210157.KAA10818@srapc451.sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>> select * from regdemo where fld1 ~* '^41|^des';
>> fld1
>> ----
>> (0 rows)
>
>> ^^^^^^^^^^^^^^
>> !?!?!?!
>
>I see it too. Even more interesting is that these variants are OK:
>
>regression=> select * from regdemo where fld1 ~* '^des|^41';
>fld1
>-----------
>410
>destination
>(2 rows)
>
>regression=> select * from regdemo where fld1 ~* '(^41)|(^des)';
>fld1
>-----------
>410
>destination
>(2 rows)
>
>And if you want *really* disturbing:
>
>regression=> select * from regdemo where fld1 ~* '^sou|^des';
>fld1
>-----------
>source
>destination
>(2 rows)
>
>regression=> select * from regdemo where fld1 ~ '^sou|^des';
>fld1
>----
>(0 rows)
>
>Something is rotten in the state of Denmark...

These all oddness are caused by the parser (makeIndexable). When
makeIndexable sees ~* '^41|^des' , it tries to rewrite the target
regexp so that an index can be used. The rewritten query might be
something like:

fld1 ~* '^41|^des' and fld1 >= '41|^' and fld1 <= '41|^\377'

Apparently this is wrong. This is because makeIndexable does not
understand '|' and '^' appearing in the middle of the regexp. On the
other hand,

>regression=> select * from regdemo where fld1 ~* '^des|^41';
>regression=> select * from regdemo where fld1 ~* '^sou|^des';

will work since makeIndexable gave up the optimization if the op is
"~*" and a letter appearing right after '^' is *alphabet*.

Note that:

>regression=> select * from regdemo where fld1 ~ '^sou|^des';

will not work because the op is *not* "~*".

It seems that the only solution is checking '|' to see if it appears
in the target regexp and giving up the optimization in that case.

One might think that ~* '^41|^des' can be rewritten like:

fld1 ~* '^41' or fld1 ~* '^des'

For me this seems not to be a good idea. To accomplish this, we have
to deeply parse the regexp (consider that we might have arbitrary
complex regexps) and such kind thing is a job regexp() shoud
do.

Comments?
---
Tatsuo Ishii

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message D'Arcy J.M. Cain 1999-05-21 02:51:44 Cache lookup failed
Previous Message Hiroshi Inoue 1999-05-21 00:16:07 RE: [HACKERS] Current TODO list