From: | Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Regex pattern with shorter back reference does NOT work as expected |
Date: | 2013-07-10 11:31:23 |
Message-ID: | CAM2+6=U8CdfM-qL55XHt+7hVzDRBnZwrHiVZRX2shGZ4OMuMSQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi Tom,
Following example does not work as expected:
-- Should return TRUE but returning FALSE
SELECT 'Programmer' ~ '(\w).*?\1' as t;
-- Should return P, a and er i.e. 3 rows but returning just one row with
-- value Programmer
SELECT REGEXP_SPLIT_TO_TABLE('Programmer','(\w).*?\1');
Initially I thought that back-reference is not supported and thus we are
getting those result. But while trying few cases related to back-reference I
see that it is giving an error "invalid back-reference number", it means we
do have support for back-reference. So I tried few more scenarios. And I
observed that if we have input string as 'rogrammer' we are getting perfect
results i.e. when very first character is back-referenced. But failing when
first character is not part of back-reference.
This is happening only for shorter pattern matching. Longer match '(\w).*\1'
works well.
Clearly, above example has two matching pattern 'rogr' and 'mm'.
So I started debugging it to get a root cause for this. It is too complex to
understand what exactly is happening here. But while debugging I got this
chunk in regexec.c:cfindloop() function from where we are returning with
REG_NOMATCH
{
/* no point in trying again */
*coldp = cold;
return REG_NOMATCH;
}
It was starting at 'P' and ending in above block. It was strange that why it
is not continuing with next character i.e. from 'r'. So I replaced above
chunk with break statement so that it will continue from next character.
This trick worked well.
Since I have very little idea at this code area, I myself unsure that it is
indeed a correct fix. And thus thought of mailing on hackers.
I have attached patch which does above changes along with few tests in
regex.sql
Your valuable insights please...
Thanks
--
Jeevan B Chalke
Attachment | Content-Type | Size |
---|---|---|
regexp_backref_shorter.patch | application/octet-stream | 2.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Kevin Grittner | 2013-07-10 13:16:04 | Re: LogSwitch |
Previous Message | Magnus Hagander | 2013-07-10 08:36:06 | Re: robots.txt on git.postgresql.org |