Re: [E] Re: Regexp_replace bug / does not terminate on long strings

From: "Markhof, Ingolf" <ingolf(dot)markhof(at)de(dot)verizon(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [E] Re: Regexp_replace bug / does not terminate on long strings
Date: 2021-08-20 16:11:41
Message-ID: CALZg0g4T5u+9DUomsuN913MK_TS1GZM=8pMuJR+j=-0yiQdJrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Tom,

thank you very much for your reply. Actually, I was assuming all these
regular expressions are based on the same core implementation.
Interestingly, this doesn't seem to be true...

I am also surprised that you say the (\1)+ subpattern is computationally
expensive. Regular expressions are greedy by default. I.e. in case of a*
matching against a string of 1000 a's, the system will not try a, aa, aaa,
... and so on, right? Instead, it will consume all the a's in one go.
Likewise, I was expecting that the system would eat all the repetitions of
a word in one go. Your proposal to use (\1)? instead at the first glance
seems to require more effort, because the same word and it's matching
successor will need to be matched again and again and again. Roughly, 2N
matches are to be done instead of just N.

However, you are perfectly right: When I use (\1)? instead of (\1)+, the
expression is evaluates quickly!

Thank you very much for looking into this and for proposing the alternative
approach which is working fine.

Regards
Ingolf

On Fri, Aug 20, 2021 at 12:42 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Markhof, Ingolf" <ingolf(dot)markhof(at)de(dot)verizon(dot)com> writes:
> > BRIEF:
> > regexp_replace(source,pattern,replacement,flags) needs very (!) long to
> > complete or does not complete at all (?!) for big input strings (a few k
> > characters). (Oracle SQL completes the same in a few ms)
>
> Regexps containing backrefs are inherently hard --- every engine has
> strengths and weaknesses. I doubt it'd be hard to find cases where
> our engine is orders of magnitude faster than Oracle's; but you've
> hit on a case where the opposite is true.
>
> The core of the problem is that it's hard to tell how much of the
> string could be matched by the (,\1)* subpattern. In principle, *all*
> of the remaining string could be, if it were N repetitions of the
> initial word. Or it could be N-1 repetitions followed by one other
> word, and so on. The difficulty is that since our engine guarantees
> to find the longest feasible match, it tries these options from
> longest to shortest. Usually the actual match (if any) will be pretty
> short, so that you have O(N) wasted work per word, making the runtime
> at least O(N^2).
>
> I think your best bet is to not try to eliminate multiple duplicates
> at a time. Get rid of one dup at a time, say by
> str := regexp_replace(str, '([^,]+)(,\1)?($|,)', '\1\3', 'g');
> and repeat till the string doesn't get any shorter.
>
> I did come across a performance bug [1] while poking at this, but
> alas fixing it doesn't move the needle very much for this example.
>
> regards, tom lane
>
> [1]
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.postgresql.org_message-2Did_1808998.1629412269-2540sss.pgh.pa.us&d=DwIBAg&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=ivZWA-ECVj3XrXBe0obDwKY7Ui7K5Nj9oD2KKWLm0Bw&m=q11bVTHCxVx8BQu2pjn6-3nOY8aN8hORXofVK38HqF8&s=hJxrzmTT6G7AUomoeFgh0IGDO3NcUP4gB9kvYHnt3m0&e=
>

======================================================================

Verizon Deutschland GmbH - Sebrathweg 20, 44149 Dortmund, Germany - Amtsgericht Dortmund, HRB 14952 - Geschäftsführer: Detlef Eppig - Vorsitzender des Aufsichtsrats: Francesco de Maio

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Markhof, Ingolf 2021-08-20 16:14:29 Re: [E] Re: Regexp_replace bug / does not terminate on long strings
Previous Message Tom Lane 2021-08-20 13:26:11 Re: Make bloom extension trusted, but can not drop with normal user