From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Pathological regexp match |
Date: | 2010-01-30 07:07:46 |
Message-ID: | 25719.1264835266@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com> writes:
> We came across a regexp that takes very much longer than expected.
I poked into this a little bit. What seems to be happening is that the
use of non-greedy quantifiers plus backreferences turns off most of the
optimization that the regexp engine usually does, leaving the RE as a
tree of possibilities that is explored in a fairly dumb fashion. In
particular, there is a loop in crevdissect() that tries to locate a
feasible division point between two concatenated sub-patterns, and
for each try, a recursive call to crevdissect() tries to locate a
feasible division point between *its* two sub-patterns, resulting in
O(N^2) runtime. With a not very pleasant constant factor, too, because
of repeated startup/shutdown costs for the DFA matching engine.
I found a possible patch that seems to improve matters: AFAICS the DFA
matching is independent of the checking that cdissect() and friends do,
so we can apply that check first instead of second. This brings the
runtime down from minutes to sub-second on my machine. However I don't
have a whole lot of faith either in the correctness of this change, or
that it might not make some other cases slower instead of faster.
Has anyone got a reasonably messy collection of regexps they'd like
to try this patch on?
BTW, I filed the problem upstream with the Tcl project
https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894
but I'm not going to hold my breath waiting for useful advice from
them. Since Henry Spencer dropped off the net, there doesn't seem
to be anybody there who understands that code much better than we do.
Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
in there ...
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
regexp.patch | text/x-patch | 2.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2010-01-30 09:33:43 | Re: PG 9.0 and standard_conforming_strings |
Previous Message | Mark Mielke | 2010-01-30 06:25:59 | Re: PG 9.0 and standard_conforming_strings |