From: | Ivan Panchenko <i(dot)panchenko(at)postgrespro(dot)ru> |
---|---|
To: | pgsql-www(at)lists(dot)postgresql(dot)org |
Subject: | Re: Mailing list search engine: surprising missing results? |
Date: | 2022-01-25 17:02:36 |
Message-ID: | 79b3eb6e-152e-3c56-7b71-51d091c0f6d9@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-www |
On 25.01.2022 19:22, Tom Lane wrote:
> Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> writes:
>> On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:
>>> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>>>>> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
>>>>>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
>>>>>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
>>>>>> isn't:
>> Not quite. The problem is question is the "'boyer-moore':1".
>> If that were "'boyer-moor':1" instead, the problem would disappear.
> Actually, when I try this here, it seems like the stemming *is*
> consistent:
>
> regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
> to_tsvector
> ----------------------------------------------------------
> 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3
> (1 row)
>
> regression=# SELECT to_tsvector('english', 'Boyer-Moore');
> to_tsvector
> -----------------------------------
> 'boyer':2 'boyer-moor':1 'moor':3
> (1 row)
>
> If you try variants of that where the first or third term is stemmable,
> say
>
> regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool');
> to_tsvector
> -----------------------------------------------------------
> 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3
> (1 row)
>
> it sure appears that each component word is stemmed independently
> already. So I think the original explanation here is wrong and
> we need to probe more closely.
The actual explanation can be seen from comparing a tsvector with a tsquery.
To avoid stemming effects, we use the simple configuration below.
# select plainto_tsquery('simple','boyers-moore');
plainto_tsquery
-------------------------------------
'boyers-moore' & 'boyers' & 'moore'
# select to_tsvector('simple','boyers-moore-horspool');
to_tsvector
-------------------------------------------------------------
'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3
Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be
'boyers-moore' | ('boyers' & 'moore')
May be, it is worth changing to_tsquery() behavior for such cases.
>
> regards, tom lane
>
>
Regards,
Ivan
From | Date | Subject | |
---|---|---|---|
Next Message | Magnus Hagander | 2022-01-25 17:03:59 | Re: Update Commitfest requirements and README |
Previous Message | Tom Lane | 2022-01-25 16:22:33 | Re: Mailing list search engine: surprising missing results? |