Re: Mailing list search engine: surprising missing results?

From: Ivan Panchenko <i(dot)panchenko(at)postgrespro(dot)ru>
To: James Addison <jay(at)jp-hosting(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-www(at)lists(dot)postgresql(dot)org
Subject: Re: Mailing list search engine: surprising missing results?
Date: 2022-01-25 21:23:35
Message-ID: a73f39bc-94f9-e8c6-9428-9ce94b33a4a7@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-www

On 25.01.2022 23:48, James Addison wrote:
> I'm uncertain why parsing hyphenated query text produces compound tokens?

Because in some cases user wants to search the full hyphenated words,
not parts of them.

But the parser is pluggable, it is possible to develop another one, such
as  pg_tsparser [1] which does the same for underscores.

*to_tsquery functions are also changeable. There can exist plenty of
them according to different user requirements.
Such function just translates the query from the user query language
with its semantics into the tsquery language.
So you may write your own and contribute it to community or not. Another
option is to make a wrapper function which will modify the result of
existing *to_tsquery function to fit your task.

> There are a couple of references[1][2] in the documentation about the
> dash character being converted to a boolean not (!) operator by
> websearch_to_tsquery, but that seems unrelated.
>
> postgres=# select plainto_tsquery('simple', 'a-b');
> plainto_tsquery
> -------------------
> 'a-b' & 'a' & 'b'
> (1 row)
>
> postgres=# select plainto_tsquery('simple', 'a_b');
> plainto_tsquery
> -----------------
> 'a' & 'b'
> (1 row)
>
> postgres=# select plainto_tsquery('simple', 'a+b');
> plainto_tsquery
> -----------------
> 'a' & 'b'
> (1 row)
In these examples, some characters are removed by the parser. Try
ts_debug('simple', 'a+b').
>
> [1] - https://www.postgresql.org/docs/14/functions-textsearch.html
> [2] - https://www.postgresql.org/docs/14/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES
> On Tue, 25 Jan 2022 at 17:54, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Ivan Panchenko <i(dot)panchenko(at)postgrespro(dot)ru> writes:
>>> The actual explanation can be seen from comparing a tsvector with a tsquery.
>>> To avoid stemming effects, we use the simple configuration below.
>>> # select plainto_tsquery('simple','boyers-moore');
>>> plainto_tsquery
>>> -------------------------------------
>>> 'boyers-moore' & 'boyers' & 'moore'
>>> # select to_tsvector('simple','boyers-moore-horspool');
>>> to_tsvector
>>> -------------------------------------------------------------
>>> 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3
>>> Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be
>>> 'boyers-moore' | ('boyers' & 'moore')
>>> May be, it is worth changing to_tsquery() behavior for such cases.
>> Changing the behavior of to_tsquery is certainly a lot less scary
>> than changing to_tsvector --- it wouldn't call the validity of
>> existing tsvector indexes into question.
>>
>> I see that to_tsquery is even sillier than plainto_tsquery:
>>
>> regression=# select to_tsquery('simple','boyers-moore');
>> to_tsquery
>> -----------------------------------------
>> 'boyers-moore' <-> 'boyers' <-> 'moore'
>> (1 row)
>>
>> which is absolutely not a sane translation.
>>
>> It seems to me that in both cases we'd be better off generating
>> "'boyers' <-> 'moore'", without the compound token at all.
>> Maybe there's a case for the weaker 'boyers' & 'moore' translation,
>> but I think if people wanted that they'd just enter separate words.

Matching the compond token might be significant for ranking. (?)

Probably, there is no universal *to_tsquery function and no universal
parser to fit all users.

[1] https://github.com/postgrespro/pg_tsparser

>>
>> regards, tom lane
>>
>>
regards, Ivan

In response to

Responses

Browse pgsql-www by date

  From Date Subject
Next Message James Addison 2022-01-26 08:28:43 Re: Mailing list search engine: surprising missing results?
Previous Message James Addison 2022-01-25 20:48:34 Re: Mailing list search engine: surprising missing results?