Re: BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: esemmano(at)gmail(dot)com
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses
Date: 2024-06-13 22:04:20
Message-ID: 2130969.1718316260@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

PG Bug reporting form <noreply(at)postgresql(dot)org> writes:
> Although the docs
> https://www.postgresql.org/docs/current/textsearch-controls.html say nothing
> about websearch_to_tsquery supporting parentheses in queries, I noticed some
> inconsistent behaviour when using multiple 'or' keywords with parentheses in
> postgres 15.4

The definition of websearch_to_tsquery says pretty plainly that
"Other punctuation is ignored". So I'd expect parens to do nothing.
That makes this problematic:

> select websearch_to_tsquery('german', 'foo or baz bar or (ding dong)');
> websearch_to_tsquery
> -----------------------------------------
> 'foo' | 'baz' & 'bar' | 'ding' & 'dong'

> select websearch_to_tsquery('german', 'foo or (baz bar) or (ding dong)');
> websearch_to_tsquery
> ------------------------------------------------
> 'foo' | 'baz' & 'bar' & 'or' & 'ding' & 'dong'

I found what seems to be the issue in gettoken_query_websearch: it
ignores ISOPERATOR chars (including parens) in WAITOPERAND state,
but not in WAITOPERATOR state. That results in switching back to
WAITOPERAND state which will consume the "or" as a regular word.
So a minimal fix could look like the attached.

It's fairly confusing that this code manages to ignore not-ISOPERATOR
punctuation. It seems like that gets eaten by gettoken_tsvector()
and then later we decide there's not really a word there.

I'm also confused how come the same thing doesn't happen in the
english tsconfig. Not sure it's worth poking at more, though.

regards, tom lane

Attachment Content-Type Size
draft-bug18479-fix.patch text/x-diff 527 bytes

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2024-06-13 23:59:22 Re: BUG #18479: websearch_to_tsquery inconsistent behavior for german when using parentheses
Previous Message Pawel Kudzia 2024-06-13 18:49:48 Re: BUG #16792: silent corruption of GIN index resulting in SELECTs returning non-matching rows