Re: Counting the number of repeated phrases in a column

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: benj(dot)dev(at)laposte(dot)net
Cc: Rob Sargent <robjsargent(at)gmail(dot)com>, pgsql-general <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Counting the number of repeated phrases in a column
Date: 2022-01-27 23:10:42
Message-ID: CAHyXU0w2-jbeQ5HPmJH87gJHwkYCcG=Q0xTOUEuLO1e25wVh-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Jan 27, 2022 at 11:56 AM <benj(dot)dev(at)laposte(dot)net> wrote:
> Le 27/01/2022 à 18:35, Merlin Moncure a écrit :
> > select distinct array_to_string(v[a:b], ' ') phrase, count(*) as occurrences
> > from
> > (
> > select array_agg(t) v
> > from
> > (
> > select trim(replace(unnest(v), E'\n', '')) t
> > from regexp_split_to_array(<sentence>, ' ') v
> > ) q
> > where length(t) > 1
> > ) q
> > cross join lateral generate_series(1, array_upper(v, 1)) a
> > cross join lateral generate_series(a + 1, array_upper(v, 1)) b
> > group by 1
> > having count(*) > 1;
> >
> > We are definitely in N^2 space here, so look for things to start
> > breaking down for sentences > 1000 words.
> >
> > merlin
> >
>
> (for better complexity) you may search about "Ukkonen suffix tree"
> Similar problem as yours :
> https://www.geeksforgeeks.org/suffix-tree-application-3-longest-repeated-substring/?ref=lbp

Yep. Many problems like this are well solved in imperative languages
and will fit poorly into SQL quase-functional space. That
implementation could probably be converted to pl/pgsql pretty easily,
or a 'sql + tables' variant as a fun challenge. It also slightly
exploits the fact that only the most repeated needle is returned,
rather than all of them.

Having the need to have single statement stateless SQL solutions to
interesting problems comes up all the time in common development
practice though for simplicity's sake even if there are better
approaches out there. It's also fun.

merlin

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Michael Harris 2022-01-28 05:27:57 Re: Undetected Deadlock
Previous Message Ben Chobot 2022-01-27 20:21:38 Re: could not open relation with OID