From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Silliness in regexp's citerdissect/creviterdissect |
Date: | 2021-08-19 22:31:09 |
Message-ID: | 1808998.1629412269@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
While poking at a report of slow regexp matching [1], I happened
to notice that citerdissect and creviterdissect are being remarkably
stupid about how to backtrack after a match failure. Specifically,
having used the DFA logic to identify K possible submatches, they
then start the slow recursive "cdissect" check of each submatch.
But, if the I'th submatch fails dissection, their response is to
see about adjusting the length of the last submatch ... which will
do nothing at all to make the result of the I'th submatch change,
if it's before K. When this happens, we should immediately
start adjusting the length of the I'th submatch.
Attached is a simple patch to do this. It passes check-world, and
shows the same results as before on Jacobson's web-regexps corpus,
as well as on a sample of random regexps made by Dilger's script.
I don't really see any performance difference on Jacobson's corpus,
but Dilger's script did find an example where this makes a huge
difference:
HEAD:
regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhjsq', '(.*)((\1)){9}');
regexp_match
--------------
(1 row)
Time: 9655.141 ms (00:09.655)
regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhxxxxxxxxxxjsq', '(.*)((\1)){9}');
regexp_match
--------------
{x,x,x}
(1 row)
Time: 271106.324 ms (04:31.106)
With patch:
regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhjsq', '(.*)((\1)){9}');
regexp_match
--------------
(1 row)
Time: 9.385 ms
regexp=# select regexp_match('nefajztngmsvfajztngmsvlwhxxxxxxxxxxjsq', '(.*)((\1)){9}');
regexp_match
--------------
{x,x,x}
(1 row)
Time: 25.103 ms
Admittedly this is a bit contrived. (In v14/HEAD, you can make the
performance problem go away by getting rid of the redundant capturing
parens, as then we don't invoke citerdissect at all. That trick doesn't
help in the back branches though.)
I think this is a pretty clear performance bug, and unless there are
objections I plan to push this fix into all branches.
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
backtrack-more-effectively-in-citerdissect.patch | text/x-diff | 2.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | David Christensen | 2021-08-19 22:43:55 | Re: [PATCH] Proof of concept for GUC improvements |
Previous Message | Vik Fearing | 2021-08-19 22:30:31 | Re: [PATCH] Proof of concept for GUC improvements |