Re: BUG #18149: Incorrect lexeme for english token "proxy"

From: Patrick Peralta <pperalta(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18149: Incorrect lexeme for english token "proxy"
Date: 2023-10-07 16:07:18
Message-ID: CADV9oGwxvJESMPTvXNT-Uz-aoUxJmmXJJxMGxiyyWA85Yas32A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Tom and Laurenz,

Thank you for your timely replies. I may have misdiagnosed my problem, so
I'll elaborate a bit more.

As you mentioned, I see that searching for the term "proxy" works in
general:

# SELECT to_tsvector('english', 'I set up an http proxy for my network.')
@@ to_tsquery('english','proxy');
?column?
----------
t
(1 row)

However I ran into an anomaly with this query:

# SELECT to_tsvector('english', 'CLOUD-PROXY-SEP19-T1-254--1695167380256')
@@ to_tsquery('english','cloud-proxy:*');
?column?
----------
f
(1 row)

When I search with the prefix "cloud-proxy" it doesn't match the input.

When I try this:

# SELECT to_tsvector('english', 'CLOUD-SERVER-SEP19-T1-254--1695167380256')
@@ to_tsquery('english','cloud-server:*');
?column?
----------
t

The prefix "cloud-server" works.

Furthermore, if I switch to the 'simple' dictionary instead of 'english:

# SELECT to_tsvector('simple', 'CLOUD-PROXY-SEP19-T1-254--1695167380256')
@@ to_tsquery('simple','cloud-proxy:*');
?column?
----------
t
(1 row)

Here the 'cloud-proxy' prefix works.

The only difference I can see is that the 'simple' dictionary uses 'proxy'
as the lexeme whereas the 'english' dictionary uses 'proxi'. Could this
explain the difference in these queries, or is there something else I'm
missing?

Thanks,
Patrick

On Sat, Oct 7, 2023 at 10:18 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> writes:
> > On Thu, 2023-10-05 at 21:44 +0000, PG Bug reporting form wrote:
> >> The english dictionary is using the lexeme "proxi" for the token
> "proxy". As
> >> a result, the search term "proxy" is not yielding results for records
> that
> >> contain this word.
>
> > I cannot reproduce that.
>
> Me either. It suggests that you're trying to match against documents
> that haven't been put through the same normalization process as the
> query.
>
> >> I think this lexeme was chosen to support the plural of proxy which is
> >> proxies. However there are other plurals where the root word is spelled
> >> different and Postgres creates the correct lexeme such as: [goose or
> mouse]
>
> > The snowball dictionary has no real knowledge of the words. Stemming is
> > done by applying some heuristics which work "well enough" in most cases.
>
> Yeah. I don't see anything hugely wrong with this particular
> transformation. It is doing something useful, in that "proxy"
> and "proxies" are both converted to the same lexeme "proxi".
> In an ideal world, the lexeme would be "proxy", but it doesn't
> really make that much difference if it isn't.
>
> In any case, changing it now wouldn't be very practical, because
> existing documents will already have been made into tsvectors
> using this rule.
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2023-10-07 16:37:37 Re: BUG #18149: Incorrect lexeme for english token "proxy"
Previous Message Tom Lane 2023-10-07 14:18:56 Re: BUG #18149: Incorrect lexeme for english token "proxy"