Re: Regexp_replace bug / does not terminate on long strings

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Markhof, Ingolf" <ingolf(dot)markhof(at)de(dot)verizon(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Regexp_replace bug / does not terminate on long strings
Date: 2021-08-19 22:42:35
Message-ID: 1809528.1629412955@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

"Markhof, Ingolf" <ingolf(dot)markhof(at)de(dot)verizon(dot)com> writes:
> BRIEF:
> regexp_replace(source,pattern,replacement,flags) needs very (!) long to
> complete or does not complete at all (?!) for big input strings (a few k
> characters). (Oracle SQL completes the same in a few ms)

Regexps containing backrefs are inherently hard --- every engine has
strengths and weaknesses. I doubt it'd be hard to find cases where
our engine is orders of magnitude faster than Oracle's; but you've
hit on a case where the opposite is true.

The core of the problem is that it's hard to tell how much of the
string could be matched by the (,\1)* subpattern. In principle, *all*
of the remaining string could be, if it were N repetitions of the
initial word. Or it could be N-1 repetitions followed by one other
word, and so on. The difficulty is that since our engine guarantees
to find the longest feasible match, it tries these options from
longest to shortest. Usually the actual match (if any) will be pretty
short, so that you have O(N) wasted work per word, making the runtime
at least O(N^2).

I think your best bet is to not try to eliminate multiple duplicates
at a time. Get rid of one dup at a time, say by
str := regexp_replace(str, '([^,]+)(,\1)?($|,)', '\1\3', 'g');
and repeat till the string doesn't get any shorter.

I did come across a performance bug [1] while poking at this, but
alas fixing it doesn't move the needle very much for this example.

regards, tom lane

[1] https://www.postgresql.org/message-id/1808998.1629412269%40sss.pgh.pa.us

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael Lewis 2021-08-19 23:52:36 Re: Regexp_replace bug / does not terminate on long strings
Previous Message Adrian Klaver 2021-08-19 22:30:24 Re: Selecting table row with latest date [RESOLVED]