websearch_to_tsquery() returns queries that don't match to_tsvector()

From: Valentin Gatien-Baron <valentin(dot)gatienbaron(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: websearch_to_tsquery() returns queries that don't match to_tsvector()
Date: 2021-04-18 14:53:36
Message-ID: CA+0DEqiZs7gdOd4ikmg=0UWG+SwWOLxPsk_JW-sx9WNOyrb0KQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Hi,

I'm surprised that the following expression is false:

select to_tsvector('english', 'aaa: bbb') @@
websearch_to_tsquery('english', '"aaa: bbb"');
?column?
----------
f
(1 row)

My expectation is that to_tsvector('english', text) @@
websearch_to_tsquery('english', '" || text || "') would be true for
all texts, or pretty close to all texts. Otherwise it makes search
rather unpredictable. The actual example that started this
investigation was searching for '"/path/to/some/exe: no such file or
directory"' (which was failing to find the exact matches that I knew
existed).

Looking at the tsvector and tsquery, we can see that the problem is
that the ":" counts as one position for the ts_query but not the
ts_vector:

select to_tsvector('english', 'aaa: bbb'), websearch_to_tsquery('english',
'"aaa: bbb"');
to_tsvector | websearch_to_tsquery
-----------------+----------------------
'aaa':1 'bbb':2 | 'aaa' <2> 'bbb'
(1 row)

So I wondered: are there more such cases? Looking at all texts of the
form 'aaa' || maybe-space || one-byte || maybe-space || 'bbb', it
happens quite a bit:

select text, ts_vector, ts_query, matches from unnest(array['', ' ']) as
prefix, unnest(array['', ' ']) as suffix, (select chr(a) as char from
generate_series(1,192) as s(a)) as zz1, lateral (select 'aaa' || prefix ||
char || suffix || 'bbb' as text) as zz2, lateral (select
to_tsvector('english', text) as ts_vector) as zz3, lateral (select
websearch_to_tsquery('english', '"' || text || '"') as ts_query) as zz4,
lateral (select ts_vector @@ ts_query as matches) as zz5 where not matches;
text | ts_vector | ts_query | matches
----------------+-----------------+------------------+---------
aaa \x01 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x02 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x03 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x04 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x05 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x06 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x07 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x08 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x0E bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x0F bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x10 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x11 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x12 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x13 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x14 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x15 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x16 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x17 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x18 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x19 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x1A bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x1B bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x1C bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x1D bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x1E bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x1F bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa # bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa $ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa % bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ' bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa * bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa + bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa , bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa . bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa / bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa: bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa : bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ; bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa = bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa > bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ? bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa @ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa [ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ] bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ^ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa _ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ` bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa { bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa } bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ~bbb | 'aaa':1 'bbb':2 | 'aaa' <-> '~bbb' | f
aaa ~ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \x7F bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0080 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0081 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0082 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0083 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0084 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0085 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0086 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0087 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0088 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0089 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u008A bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u008B bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u008C bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u008D bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u008E bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u008F bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0090 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0091 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0092 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0093 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0094 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0095 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0096 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0097 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0098 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u0099 bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u009A bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u009B bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u009C bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u009D bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u009E bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa \u009F bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¡ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¢ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa £ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¤ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¥ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¦ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa § bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¨ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa © bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa « bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¬ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ­ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ® bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¯ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ° bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ± bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ² bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ³ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ´ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¶ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa · bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¸ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¹ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa » bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¼ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ½ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¾ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
aaa ¿ bbb | 'aaa':1 'bbb':2 | 'aaa' <2> 'bbb' | f
(114 rows)

There is no obvious workaround either:

- there's no function that converts a tsvector like 'aaa':1 'bbb':2
into a tsquery like 'aaa' <-> 'bbb', that one might be able to use to
build a query with exactly the same normalization as tsvector.

- replacing all problematic characters above by spaces seems to work
for most characters but not others, as for instance it fixes 'aaa
. bbb' but breaks 'aaa.bbb'.

select version();
version

---------------------------------------------------------------------------------------------------------
PostgreSQL 14devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit
(1 row)

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Pete O'Such 2021-04-19 00:23:00 Re: BUG #16959: Unnesting null from string_to_array silently removes whole rows from result
Previous Message Tom Lane 2021-04-17 15:16:11 Re: BUG #16969: INSERT of multiple rows into GENERATED ALWAYS AS IDENTITY column with DEFAULT value is broken.

Browse pgsql-hackers by date

  From Date Subject
Next Message Andy Fan 2021-04-18 15:06:15 2 questions about volatile attribute of pg_proc.
Previous Message Andy Fan 2021-04-18 14:39:18 Consider parent's stats for set_append_rel_size.