Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

From: Dimitri Nüscheler <dimitri(dot)nuescheler(at)gmail(dot)com>
To: Dimitri Nüscheler <dimitri(dot)nuescheler(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term
Date: 2021-02-14 22:05:41
Message-ID: CABQ6qDxru7+io1=4zg5jNvrZQKS32JMtjY49NXHbxXPhEDCP9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Meanwhile I managed to anonymize the data. I put it in this archive
https://www.violetsky.ch/postgres-issue-anonymized.tar.gz
It's called anonymous.csv and the scripts to reproduce it are also included
(See the file "steps").

The query that reproduces the issue is now a bit different with the data
anonymized:

SELECT * FROM sherlock_catalog WHERE textsearch @@ '''glonoin'':* &
!''urinarium'':*'::tsquery AND metadata LIKE '%urinarium%' LIMIT 2;

You will be able to see that the textsearch tsvector contains "urinarium"
when it shouldn't.
I also checked that the commit right before
(25244b8972a34b838c4033fe9efc1d31cba9d0e3) does not have the issue and
returns 0 rows instead only.

Am So., 14. Feb. 2021 um 11:17 Uhr schrieb Dimitri Nüscheler <
dimitri(dot)nuescheler(at)gmail(dot)com>:

> It seems the issue is not related to negation, because I also get these
> results:
> user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
> '''full'':* & ''suppli'':* & ''discontinu'':*'::tsquery AND metadata NOT
> LIKE '%iscontinu%';
> count
> -------
> 4499
> (1 Zeile)
>
> These results don't contain "discontinu" in the tsvector.
>
> Am So., 14. Feb. 2021 um 08:41 Uhr schrieb Dimitri Nüscheler <
> dimitri(dot)nuescheler(at)gmail(dot)com>:
>
>> Sorry, I forgot to post the DDL for the GIN INDEX:
>> CREATE INDEX IF NOT EXISTS sherlock_catalog_textsearch ON
>> sherlock_catalog USING GIN (textsearch);
>>
>> I also uploaded the scripts I used to bisect the issue here, but I didn't
>> include the data file yet. The file "steps" explains the steps.:
>> https://www.violetsky.ch/postgres-issue.tar.gz
>>
>> I could share the data file with a few individuals until I find a way to
>> anonymize or reduce the data set. My attempts so far, resulted in the bug
>> no longer showing.
>>
>> Regards
>> Dimitri
>>
>> Am So., 14. Feb. 2021 um 00:23 Uhr schrieb PG Bug reporting form <
>> noreply(at)postgresql(dot)org>:
>>
>>> The following bug has been logged on the website:
>>>
>>> Bug reference: 16865
>>> Logged by: Dimitri Nüscheler
>>> Email address: dimitri(dot)nuescheler(at)gmail(dot)com
>>> PostgreSQL version: 13.1
>>> Operating system: Debian
>>> Description:
>>>
>>> Hello
>>>
>>> I'm writing a small search engine application that also supports negation
>>> (exclude results that contain terms starting with string).
>>>
>>> After upgrading from PostgreSQL 12.5 to 13.1 the negation within a
>>> tsquery
>>> when matched to a tsvector using the @@ operator no longer works
>>> properly,
>>> so I started bisecting the commit history and tracked it down to this
>>> commit
>>> (see the query and expected and actual result further below):
>>>
>>> > commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad)
>>> > Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
>>> > Date: Fri Jul 24 15:26:51 2020 -0400
>>> >
>>> > Fix assorted bugs by changing TS_execute's callback API to ternary
>>> logic.
>>> > ...
>>>
>>> I'm still working on creating a reproducible test-case without having to
>>> share company data. I'm also trying to understand the code as a fun
>>> exercise.
>>>
>>> I can at least share some of the queries and result data:
>>>
>>> DDL:
>>> CREATE TABLE IF NOT EXISTS sherlock_catalog (
>>> uri varchar,
>>> description varchar NOT NULL,
>>> metadata varchar NOT NULL,
>>> textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english',
>>> sherlock_normalize(uri || ' ' || description || ' ' || metadata)))
>>> STORED,
>>> last_seen timestamptz,
>>> PRIMARY KEY (uri)
>>> );
>>> CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS
>>> varchar
>>> AS $$
>>> BEGIN
>>> RETURN lower(regexp_replace(regexp_replace(regexp_replace(str,
>>> '[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1
>>> \2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g'));
>>> END;
>>> $$
>>> LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;
>>>
>>> Query:
>>> SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* &
>>> ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
>>> '%Discontinued%' LIMIT 2;
>>>
>>> Result: 2 rows
>>>
>>> Expected result: 0 rows, because the textsearch vector contains:
>>> 'discontinu':86
>>>
>>> Plan:
>>> Limit (cost=130.61..12789.24 rows=1 width=136)
>>> -> Bitmap Heap Scan on sherlock_catalog (cost=130.61..12789.24
>>> rows=1
>>> width=136)
>>> Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* &
>>> !''discontinu'':*'::tsquery)
>>> Filter: ((metadata)::text ~~ '%Discontinued%'::text)
>>> -> Bitmap Index Scan on sherlock_catalog_textsearch
>>> (cost=0.00..130.61 rows=3548 width=0)
>>> Index Cond: (textsearch @@ '''full'':* & ''suppli'':* &
>>> !''discontinu'':*'::tsquery)
>>>
>>> The generated plan is structurally the same for PostgreSQL 12.5
>>> respectively
>>> versions before that commit, but if I alter the plan using (SET
>>> enable_*scan
>>> = off) so that the planner comes up with a sequential scan, I will get
>>> the
>>> expected results.
>>>
>>> Some other results (count):
>>>
>>> Count of the same query:
>>> user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
>>> '''full'':*
>>> & ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
>>> '%Discontinu%';
>>> count
>>> -------
>>> 4962
>>> (1 Zeile)
>>>
>>>
>>> Negation, but not a prefix search:
>>> user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
>>> '''full'':*
>>> & ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE
>>> '%Discontinu%';
>>> count
>>> -------
>>> 0
>>> (1 Zeile)
>>>
>>>
>>> Without negation:
>>> user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
>>> '''full'':*
>>> & ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE
>>> '%Discontinu%';
>>> count
>>> -------
>>> 13127
>>> (1 Zeile)
>>>
>>> Without the "discontinu" term at all
>>>
>>> user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
>>> '''full'':*
>>> & ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%';
>>> count
>>> -------
>>> 13127
>>> (1 Zeile)
>>>
>>>
>>> So it seems like the negated query manages to filter out some data, but
>>> not
>>> all - as if it failed to recheck and definitely determine the TS_YES
>>> respectively TS_NO answer from an in-precise TS_MAYBE answer from an
>>> unprecise index-based answer (without position information?)? if I even
>>> understand this remotely correctly, I'm new to this.
>>> I'll try to find out more and to prepare shareable data that reproduces
>>> the
>>> problem, but I also wonder if I manage to dive into the code and
>>> understand
>>> something about it :-)
>>>
>>> Kind Regards
>>> Dimitri Nüscheler
>>>
>>>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Devrim Gündüz 2021-02-15 13:32:40 Re: AW: BUG #16859: PostGIS 30 and 31 installation on SLES15 SP2 missing package SFCGAL or gmp
Previous Message Dimitri Nüscheler 2021-02-14 10:17:46 Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term