Re: Counting the number of repeated phrases in a column

From: "Ivan E(dot) Panchenko" <i(dot)panchenko(at)postgrespro(dot)ru>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Counting the number of repeated phrases in a column
Date: 2022-01-26 10:38:01
Message-ID: 80903e46-1996-89a1-c937-831e44fcfcb7@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


On 26.01.2022 11:11, Shaozhong SHI wrote:
>
>
> On Tue, 25 Jan 2022 at 21:33, Ivan Panchenko
> <i(dot)panchenko(at)postgrespro(dot)ru <mailto:i(dot)panchenko(at)postgrespro(dot)ru>> wrote:
>
>
> On 26.01.2022 00:21, benj(dot)dev(at)laposte(dot)net
> <mailto:benj(dot)dev(at)laposte(dot)net> wrote:
> > Le 25/01/2022 à 18:10, Shaozhong SHI a écrit :
> >> There is a short of a function in the standard Postgres to do the
> >> following:
> >>
> >> It is easy to count the number of occurrence of words, but it is
> >> rather difficult to count the number of occurrence of phrases.
> >>
> >> For instance:
> >>
> >> A cell of value:  'Hello World' means 1 occurrence a phrase.
> >>
> >> A cell of value: 'Hello World World Hello' means no occurrence
> of any
> >> repeated phrase.
> >>
> >> But, A cell of value: 'Hello World World Hello Hello World'
> means 2
> >> occurrences of 'Hello World'.
> >>
> >> 'The City of London, London' also has no occurrences of any
> repeated
> >> phrase.
> >>
> >> Anyone has got such a function to check out the number of
> occurrence
> >> of any repeated phrases?
> >>
> >> Regards,
> >>
> >> David
> >
> > Don't know if it's exactly what you want, but you can replace all
> > occurence of the phrase in the text by empty string and compute the
> > diff between the initial and the result and next divide by the
> length
> > of your phrase.
> >
> > Example :
> > WITH x AS (SELECT 'toto like tata and toto like titi and toto like
> > tutu' , 'toto like' phrase)
> > SELECT (char_length(texte) - char_length(replace(texte, phrase,
> '')))
> > / char_length(phrase) AS nb_occurence
> > FROM x
> >
> This works if the user knows the phrase. As far as I understood, the
> phrase is not known, and user wants to count number of repeats of any
> phrases.
> Of course this can be done with recursive CTE. Split into words,
> generate all phrases (AFAIK requires recursion), then group and count.
>
> But probably in PL/Perl this could be done  more effectively.
>
>
> Is there an example of using recursive CTE to split a text string into
> words?

Recursion is not needed for splitting into words. This can be done by
regexp_split_to_table function.

But generation of all possible phrases from the given list of words
probably requires recursion. On the first step the list of words becomes
a list of a single-worded phrases. On each iteration then, you add the
next word to each existing phrase, if it is possible (i.e. until the
last word is reached).

>
> Regards,
>
> David

Regards,
Ivan

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Matthias Apitz 2022-01-26 10:55:38 SELECT with LIKE clause makes full table scan
Previous Message benj.dev 2022-01-26 09:52:59 Re: Counting the number of repeated phrases in a column