From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Pathological regexp match |
Date: | 2010-01-31 17:02:56 |
Message-ID: | 9837222c1001310902g68b9c352i4bff1ed494dc47ac@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Jan 30, 2010 at 08:07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Michael Glaesemann <michael(dot)glaesemann(at)myyearbook(dot)com> writes:
>> We came across a regexp that takes very much longer than expected.
>
> I poked into this a little bit. What seems to be happening is that the
> use of non-greedy quantifiers plus backreferences turns off most of the
> optimization that the regexp engine usually does, leaving the RE as a
> tree of possibilities that is explored in a fairly dumb fashion. In
> particular, there is a loop in crevdissect() that tries to locate a
> feasible division point between two concatenated sub-patterns, and
> for each try, a recursive call to crevdissect() tries to locate a
> feasible division point between *its* two sub-patterns, resulting in
> O(N^2) runtime. With a not very pleasant constant factor, too, because
> of repeated startup/shutdown costs for the DFA matching engine.
>
> I found a possible patch that seems to improve matters: AFAICS the DFA
> matching is independent of the checking that cdissect() and friends do,
> so we can apply that check first instead of second. This brings the
> runtime down from minutes to sub-second on my machine. However I don't
> have a whole lot of faith either in the correctness of this change, or
> that it might not make some other cases slower instead of faster.
> Has anyone got a reasonably messy collection of regexps they'd like
> to try this patch on?
I only have the one, so I don't think I can help all that much with that.
> BTW, I filed the problem upstream with the Tcl project
> https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894
> but I'm not going to hold my breath waiting for useful advice from
> them. Since Henry Spencer dropped off the net, there doesn't seem
> to be anybody there who understands that code much better than we do.
>
> Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
> in there ...
Yeah. I have zero experience around that code, so if oyu have at least
some, I'd much appreciate it if you (or someone who does) could look
at it. Likely to cause a lot less breakage than me :D
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
From | Date | Subject | |
---|---|---|---|
Next Message | Magnus Hagander | 2010-01-31 17:18:43 | Re: pg_listener entries deleted under heavy NOTIFY load only on Windows |
Previous Message | Tom Lane | 2010-01-31 16:56:11 | Re: odd output in initdb |