Quick Links

Phrase search vs. multi-lexeme tokens

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Phrase search vs. multi-lexeme tokens
Date:	2020-11-12 13:09:51
Message-ID:	CAPpHfdv0EzVhf6CWfB1_TTZqXV_2Sn-jSY3zSd7ePH=-+1V2DQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hackers,

I'm investigating the bug report [1] about the behavior of
websearch_to_tsquery() with quotes and multi-lexeme tokens. See the
example below.

# select to_tsvector('pg_class foo') @@ websearch_to_tsquery('"pg_class
foo"');
?column?
----------
f

So, tsvector doesn't match tsquery, when absolutely the same text was
put to the to_tsvector() and to the quotes of websearch_to_tsquery().
Looks wrong to me. Let's examine output of to_tsvector() and
websearch_to_tsquery().

# select to_tsvector('pg_class foo');
to_tsvector
--------------------------
'class':2 'foo':3 'pg':1

# select websearch_to_tsquery('"pg_class foo"');
websearch_to_tsquery
------------------------------
( 'pg' & 'class' ) <-> 'foo'
(1 row)

So, 'pg_class' token was split into two lexemes 'pg' and 'class'. But
the output websearch_to_tsquery() connects 'pg' and 'class' with &
operator. tsquery expects 'pg' and 'class' to be both neighbors of
'foo'. So, 'pg' and 'class' are expected to share the same position,
and that isn't true for tsvector. Let's see how phraseto_tsquery()
handles that.

# select to_tsvector('pg_class foo') @@ phraseto_tsquery('pg_class foo');
?column?
----------
t

# select phraseto_tsquery('pg_class foo');
phraseto_tsquery
----------------------------
'pg' <-> 'class' <-> 'foo'

phraseto_tsquery() connects all the lexemes with phrase operators and
everything works OK.

For me it's obvious that phraseto_tsquery() and websearch_to_tsquery()
with quotes should work the same way. Noticeably, current behavior of
websearch_to_tsquery() is recorded in the regression tests. So, it
might look that this behavior is intended, but it's too ridiculous and
I think the regression tests contain oversight as well.

I've prepared a fix, which doesn't break the fts parser abstractions
too much (attached patch), but I've faced another similar issue in
to_tsquery().

# select to_tsvector('pg_class foo') @@ to_tsquery('pg_class <-> foo');
?column?
----------
f

# select to_tsquery('pg_class <-> foo');
to_tsquery
------------------------------
( 'pg' & 'class' ) <-> 'foo'

I think if a user writes 'pg_class <-> foo', then it's expected to
match 'pg_class foo' independently on which lexemes 'pg_class' is
split into.

This issue looks like the much more complex design bug in phrase
search. Fixing this would require some kind of readahead or multipass
processing, because we don't know how to process 'pg_class' in
advance.

Is this really a design bug existing in phrase search from the
beginning. Or am I missing something?

Links
1. https://www.postgresql.org/message-id/16592-70b110ff9731c07d%40postgresql.org

------
Regards,
Alexander Korotkov

Attachment	Content-Type	Size
websearch_fix_p2.patch	application/octet-stream	4.5 KB

Responses

Re: Phrase search vs. multi-lexeme tokens at 2020-12-01 16:13:53 from Alexander Korotkov
Re: Phrase search vs. multi-lexeme tokens at 2021-01-06 17:18:32 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Daniel Gustafsson	2020-11-12 13:17:31	Re: Online checksums patch - once again
Previous Message	Heikki Linnakangas	2020-11-12 12:58:02	Re: Refactor pg_rewind code and make it work against a standby