Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

From: Dimitri Nüscheler <dimitri(dot)nuescheler(at)gmail(dot)com>
To: Dimitri Nüscheler <dimitri(dot)nuescheler(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term
Date: 2021-02-14 10:17:46
Message-ID: CABQ6qDw36YcsuM94Ku4fkevN+pubZdnb1T5DRGtNcpU_8iRPBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

It seems the issue is not related to negation, because I also get these
results:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
'''full'':* & ''suppli'':* & ''discontinu'':*'::tsquery AND metadata NOT
LIKE '%iscontinu%';
count
-------
4499
(1 Zeile)

These results don't contain "discontinu" in the tsvector.

Am So., 14. Feb. 2021 um 08:41 Uhr schrieb Dimitri Nüscheler <
dimitri(dot)nuescheler(at)gmail(dot)com>:

> Sorry, I forgot to post the DDL for the GIN INDEX:
> CREATE INDEX IF NOT EXISTS sherlock_catalog_textsearch ON sherlock_catalog
> USING GIN (textsearch);
>
> I also uploaded the scripts I used to bisect the issue here, but I didn't
> include the data file yet. The file "steps" explains the steps.:
> https://www.violetsky.ch/postgres-issue.tar.gz
>
> I could share the data file with a few individuals until I find a way to
> anonymize or reduce the data set. My attempts so far, resulted in the bug
> no longer showing.
>
> Regards
> Dimitri
>
> Am So., 14. Feb. 2021 um 00:23 Uhr schrieb PG Bug reporting form <
> noreply(at)postgresql(dot)org>:
>
>> The following bug has been logged on the website:
>>
>> Bug reference: 16865
>> Logged by: Dimitri Nüscheler
>> Email address: dimitri(dot)nuescheler(at)gmail(dot)com
>> PostgreSQL version: 13.1
>> Operating system: Debian
>> Description:
>>
>> Hello
>>
>> I'm writing a small search engine application that also supports negation
>> (exclude results that contain terms starting with string).
>>
>> After upgrading from PostgreSQL 12.5 to 13.1 the negation within a tsquery
>> when matched to a tsvector using the @@ operator no longer works properly,
>> so I started bisecting the commit history and tracked it down to this
>> commit
>> (see the query and expected and actual result further below):
>>
>> > commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad)
>> > Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
>> > Date: Fri Jul 24 15:26:51 2020 -0400
>> >
>> > Fix assorted bugs by changing TS_execute's callback API to ternary
>> logic.
>> > ...
>>
>> I'm still working on creating a reproducible test-case without having to
>> share company data. I'm also trying to understand the code as a fun
>> exercise.
>>
>> I can at least share some of the queries and result data:
>>
>> DDL:
>> CREATE TABLE IF NOT EXISTS sherlock_catalog (
>> uri varchar,
>> description varchar NOT NULL,
>> metadata varchar NOT NULL,
>> textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english',
>> sherlock_normalize(uri || ' ' || description || ' ' || metadata))) STORED,
>> last_seen timestamptz,
>> PRIMARY KEY (uri)
>> );
>> CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS varchar
>> AS $$
>> BEGIN
>> RETURN lower(regexp_replace(regexp_replace(regexp_replace(str,
>> '[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1
>> \2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g'));
>> END;
>> $$
>> LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;
>>
>> Query:
>> SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* &
>> ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
>> '%Discontinued%' LIMIT 2;
>>
>> Result: 2 rows
>>
>> Expected result: 0 rows, because the textsearch vector contains:
>> 'discontinu':86
>>
>> Plan:
>> Limit (cost=130.61..12789.24 rows=1 width=136)
>> -> Bitmap Heap Scan on sherlock_catalog (cost=130.61..12789.24 rows=1
>> width=136)
>> Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* &
>> !''discontinu'':*'::tsquery)
>> Filter: ((metadata)::text ~~ '%Discontinued%'::text)
>> -> Bitmap Index Scan on sherlock_catalog_textsearch
>> (cost=0.00..130.61 rows=3548 width=0)
>> Index Cond: (textsearch @@ '''full'':* & ''suppli'':* &
>> !''discontinu'':*'::tsquery)
>>
>> The generated plan is structurally the same for PostgreSQL 12.5
>> respectively
>> versions before that commit, but if I alter the plan using (SET
>> enable_*scan
>> = off) so that the planner comes up with a sequential scan, I will get
>> the
>> expected results.
>>
>> Some other results (count):
>>
>> Count of the same query:
>> user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
>> '''full'':*
>> & ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
>> '%Discontinu%';
>> count
>> -------
>> 4962
>> (1 Zeile)
>>
>>
>> Negation, but not a prefix search:
>> user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
>> '''full'':*
>> & ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE
>> '%Discontinu%';
>> count
>> -------
>> 0
>> (1 Zeile)
>>
>>
>> Without negation:
>> user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
>> '''full'':*
>> & ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE
>> '%Discontinu%';
>> count
>> -------
>> 13127
>> (1 Zeile)
>>
>> Without the "discontinu" term at all
>>
>> user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
>> '''full'':*
>> & ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%';
>> count
>> -------
>> 13127
>> (1 Zeile)
>>
>>
>> So it seems like the negated query manages to filter out some data, but
>> not
>> all - as if it failed to recheck and definitely determine the TS_YES
>> respectively TS_NO answer from an in-precise TS_MAYBE answer from an
>> unprecise index-based answer (without position information?)? if I even
>> understand this remotely correctly, I'm new to this.
>> I'll try to find out more and to prepare shareable data that reproduces
>> the
>> problem, but I also wonder if I manage to dive into the code and
>> understand
>> something about it :-)
>>
>> Kind Regards
>> Dimitri Nüscheler
>>
>>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Dimitri Nüscheler 2021-02-14 22:05:41 Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term
Previous Message Dimitri Nüscheler 2021-02-14 07:41:49 Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term