Re: Regexp_replace bug / does not terminate on long strings

From: Michael Lewis <mlewis(at)entrata(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Markhof, Ingolf" <ingolf(dot)markhof(at)de(dot)verizon(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Regexp_replace bug / does not terminate on long strings
Date: 2021-08-19 23:52:36
Message-ID: CAHOFxGqBuVL6pjKn_G88D_OzE+ufG1GQ6xzr1ZGfQ+tXd_gLrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

If you need it ordered, this is a bit awkward but works and returns for me
in about 5ms on my dev machine.

select string_agg( value, ',' ) As final_result from(

select

value,

min( row_num ) as min_row_num

from(

select

sub.value,

row_number() over () as row_num

from

( select unnest( string_to_array(
'1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,1/1,2/1,2/1,2/1,2/1,2/2,2/1,2/1,2/2,2/2,2/1,2/1,2/2,2/2,2/1,2/2,2/1,2/1,2/2,2/1,2/1,2/2,3/1,3/1,3/2,3/1,3/1,3/1,3/1,3/1,3/1,3/3,3/1,3/1,3/1,3/3,3/1,3/1,3/2,3/1,3/1,3/1,3/3,3/3,3/1,3/1,4/1,4/1,4/1,4/1,4/1,4/1,5/2,5/5,5/1,5/5,5/5,5/2,5/1,5/1,5/5,5/1,6/1,6/1,6/1,6/1,6/3,6/6,6/1,6/1,6/1,6/3,6/1,6/1,6/1,6/1,6/1,6/1,6/6,6/1,7/1,7/3,7/1,7/1,7/1,7/5,7/1,7/1,7/1,7/1,7/1,7/1,7/1,7/5,7/1,7/3,7/1,8/1,8/1,8/2,8/2,8/1,8/6,8/1,8/1,8/1,8/1,8/6,8/1,8/1,8/1,9/2,9/1,9/1,9/2,10/4,10/2,10/2,10/2,10/2,10/1,10/4,10/10,10/10,10/2,10/1,10/1,10/2,10/1,10/8,10/1,10/3,10/2,10/5,10/10,10/2,10/10,10/2,10/3,10/1,10/1,10/1,10/1,10/8,10/5,12/5,12/3,12/5,12/5,12/1,12/5,12/1,12/3,12/1,12/1,12/5,12/2,12/1,12/0.768,12/1,12/2,12/2,12/2,12/2,12/2,12/1,12/1,14/3,15/1,15/10,15/1,15/2,15/3,15/2,15/1,15/15,15/1,15/2,15/4,15/15,15/5,15/1,15/2,15/15,15/1,15/5,15/1,15/3,15/5,15/5,15/1,15/10,15/4,15/2,15/2,15/15,15/3,15/2,15/2,15/3,15/3,16/3,16/3,18/4,18/3,18/1,18/2,18/2,18/2,18/4,18/2,18/2,18/1,18/3,20/20,20/5,20/0.896,20/5,20/1,20/2,20/1,20/3,20/4,20/5,20/10,20/20,20/10,20/5,20/1,20/1,20/4,20/2,20/1,20/1,20/3,20/4,20/20,20/2,20/20,20/2,20/20,20/20,24/3,24/4,24/2,24/4,24/3,24/3,24/3,24/3,25/3,25/4,25/10,25/25,25/6,25/1,25/7,25/2,25/5,25/2,25/2,25/25,25/3,25/25,25/25,25/10,25/3,25/10,25/25,25/6,25/4,25/25,25/5,25/5,25/3,25/1,25/5,25/10,25/25,25/7,25/14,25/5,25/5,25/5,25/3,25/5,25/14,25/2,30/2,30/7,30/3,30/8,30/15,30/1,30/4,30/7,30/2,30/5,30/30,30/8,30/5,30/5,30/5,30/8,30/8,30/1,30/10,30/3,30/30,30/10,30/4,30/30,30/30,30/3,30/1,30/15,30/3,30/5,30/3,35/5,35/12,35/5,35/10,35/3,35/3,35/4,35/10,35/12,35/5,35/5,35/4,40/15,40/4,40/40,40/2,40/10,40/10,40/5,40/3,40/3,40/40,40/4,40/15,40/10,40/2,40/5,45/6,45/6,45/6,45/6,45/6,50/8,50/4,50/50,50/8,50/15,50/7,50/3,50/20,50/25,50/50,50/5,50/50,50/12,50/7,50/4,50/15,50/10,50/8,50/3,50/2,50/20,50/25,50/10,50/8,50/50,50/5,50/5,50/50,50/10,50/10,50/10,50/5,50/5,50/4,50/10,50/50,50/5,50/50,50/8,50/8,50/50,50/50,50/10,50/12,50/2,50/5,55/10,55/10,55/5,55/5,60/3,60/25,60/4,60/60,60/30,60/25,60/6,60/6,60/10,60/5,60/5,60/3,60/10,60/5,60/5,60/10,60/30,60/6,60/5,60/10,60/60,70/10,70/10,75/15,75/3,75/4,75/10,75/20,75/5,75/6,75/8,75/6,75/7,75/75,75/7,75/15,75/25,75/15,75/7,75/3,75/15,75/8,75/30,75/75,75/8,75/8,75/5,75/20,75/75,75/6,75/30,75/15,75/75,75/8,75/25,75/15,75/7,75/75,75/10,75/4,80/5,80/5,90/50,90/50,100/7,100/10,100/10,100/10,100/10,100/10,100/8,100/20,100/10,100/20,100/20,100/5,100/25,100/8,100/5,100/100,100/8,100/20,100/8,100/10,100/20,100/7,100/6,100/50,100/15,100/10,100/2,100/35,100/10,100/10,100/35,100/30,100/100,100/5,100/40,100/35,100/100,100/50,100/35,100/30,100/7,100/10,100/10,100/7,100/25,100/100,100/40,100/5,100/15,100/6,100/7,100/20,100/10,100/2,100/20,105/10,105/20,105/10,105/20,120/15,120/10,120/15,120/15,150/150,150/150,150/75,150/20,150/20,150/20,150/30,150/75,150/25,150/15,150/25,150/150,150/15,150/6,150/6,150/30,150/20,200/15,200/15,200/20,200/10,200/40,200/15,200/40,200/50,200/7,200/20,200/15,200/25,200/20,200/7,200/15,200/7,200/15,200/20,200/20,200/200,200/15,200/50,200/10,200/20,200/20,200/20,200/200,200/20,200/25,200/7,240/15,240/15,250/20,250/50,250/20,250/10,250/10,250/25,250/250,250/250,250/25,250/50,250/25,300/20,300/20,300/30,300/7,300/20,300/300,300/300,300/20,300/30,300/20,300/7,300/10,300/20,300/20,300/30,300/20,300/7,300/7,300/20,300/30,300/30,300/300,300/50,300/300,300/30,300/10,300/300,300/300,300/50,300/300,400/20,400/20,400/25,400/25,450/50,450/50,500/500,500/50,500/50,500/50,500/50,500/500,500/500,500/500,500/35,500/25,500/35,500/25,500/500,600/40,600/40,1000/20,1000/40,1000/20,1000/1000,1000/20,1000/35,1000/1000,1000/20,1000/50,1000/500,1000/50,1000/50,1000/1000,1000/500,1000/1000,1000/35,1000/40',
',' ) ) as value

) as sub

) sub_outer

group by value

order by min_row_num

) sub_outermost;

/* return value
1/1,2/1,2/2,3/1,3/2,3/3,4/1,5/2,5/5,5/1,6/1,6/3,6/6,7/1,7/3,7/5,8/1,8/2,8/6,9/2,9/1,10/4,10/2,10/1,10/10,10/8,10/3,10/5,12/5,12/3,12/1,12/2,12/0.768,14/3,15/1,15/10,15/2,15/3,15/15,15/4,15/5,16/3,18/4,18/3,18/1,18/2,20/20,20/5,20/0.896,20/1,20/2,20/3,20/4,20/10,24/3,24/4,24/2,25/3,25/4,25/10,25/25,25/6,25/1,25/7,25/2,25/5,25/14,30/2,30/7,30/3,30/8,30/15,30/1,30/4,30/5,30/30,30/10,35/5,35/12,35/10,35/3,35/4,40/15,40/4,40/40,40/2,40/10,40/5,40/3,45/6,50/8,50/4,50/50,50/15,50/7,50/3,50/20,50/25,50/5,50/12,50/10,50/2,55/10,55/5,60/3,60/25,60/4,60/60,60/30,60/6,60/10,60/5,70/10,75/15,75/3,75/4,75/10,75/20,75/5,75/6,75/8,75/7,75/75,75/25,75/30,80/5,90/50,100/7,100/10,100/8,100/20,100/5,100/25,100/100,100/6,100/50,100/15,100/2,100/35,100/30,100/40,105/10,105/20,120/15,120/10,150/150,150/75,150/20,150/30,150/25,150/15,150/6,200/15,200/20,200/10,200/40,200/50,200/7,200/25,200/200,240/15,250/20,250/50,250/10,250/25,250/250,300/20,300/30,300/7,300/300,300/10,300/50,400/20,400/25,450/50,500/500,500/50,500/35,500/25,600/40,1000/20,1000/40,1000/1000,1000/35,1000/50,1000/500

*/

If you don't need the order maintained, it becomes a much simpler problem
and you can strip off some of this complexity.

*Michael Lewis | Database Engineer*
*Entrata*

On Thu, Aug 19, 2021 at 4:42 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Markhof, Ingolf" <ingolf(dot)markhof(at)de(dot)verizon(dot)com> writes:
> > BRIEF:
> > regexp_replace(source,pattern,replacement,flags) needs very (!) long to
> > complete or does not complete at all (?!) for big input strings (a few k
> > characters). (Oracle SQL completes the same in a few ms)
>
> Regexps containing backrefs are inherently hard --- every engine has
> strengths and weaknesses. I doubt it'd be hard to find cases where
> our engine is orders of magnitude faster than Oracle's; but you've
> hit on a case where the opposite is true.
>
> The core of the problem is that it's hard to tell how much of the
> string could be matched by the (,\1)* subpattern. In principle, *all*
> of the remaining string could be, if it were N repetitions of the
> initial word. Or it could be N-1 repetitions followed by one other
> word, and so on. The difficulty is that since our engine guarantees
> to find the longest feasible match, it tries these options from
> longest to shortest. Usually the actual match (if any) will be pretty
> short, so that you have O(N) wasted work per word, making the runtime
> at least O(N^2).
>
> I think your best bet is to not try to eliminate multiple duplicates
> at a time. Get rid of one dup at a time, say by
> str := regexp_replace(str, '([^,]+)(,\1)?($|,)', '\1\3', 'g');
> and repeat till the string doesn't get any shorter.
>
> I did come across a performance bug [1] while poking at this, but
> alas fixing it doesn't move the needle very much for this example.
>
> regards, tom lane
>
> [1]
> https://www.postgresql.org/message-id/1808998.1629412269%40sss.pgh.pa.us
>
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Michael Lewis 2021-08-19 23:58:03 Re: Regexp_replace bug / does not terminate on long strings
Previous Message Tom Lane 2021-08-19 22:42:35 Re: Regexp_replace bug / does not terminate on long strings