Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term

From: Dimitri Nüscheler <dimitri(dot)nuescheler(at)gmail(dot)com>
To: Dimitri Nüscheler <dimitri(dot)nuescheler(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term
Date: 2021-02-14 07:41:49
Message-ID: CABQ6qDwW4gYnEWLuxRJqxxVnStpwgUgKV_SbXc_EZytgSbT=vw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Sorry, I forgot to post the DDL for the GIN INDEX:
CREATE INDEX IF NOT EXISTS sherlock_catalog_textsearch ON sherlock_catalog
USING GIN (textsearch);

I also uploaded the scripts I used to bisect the issue here, but I didn't
include the data file yet. The file "steps" explains the steps.:
https://www.violetsky.ch/postgres-issue.tar.gz

I could share the data file with a few individuals until I find a way to
anonymize or reduce the data set. My attempts so far, resulted in the bug
no longer showing.

Regards
Dimitri

Am So., 14. Feb. 2021 um 00:23 Uhr schrieb PG Bug reporting form <
noreply(at)postgresql(dot)org>:

> The following bug has been logged on the website:
>
> Bug reference: 16865
> Logged by: Dimitri Nüscheler
> Email address: dimitri(dot)nuescheler(at)gmail(dot)com
> PostgreSQL version: 13.1
> Operating system: Debian
> Description:
>
> Hello
>
> I'm writing a small search engine application that also supports negation
> (exclude results that contain terms starting with string).
>
> After upgrading from PostgreSQL 12.5 to 13.1 the negation within a tsquery
> when matched to a tsvector using the @@ operator no longer works properly,
> so I started bisecting the commit history and tracked it down to this
> commit
> (see the query and expected and actual result further below):
>
> > commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad)
> > Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> > Date: Fri Jul 24 15:26:51 2020 -0400
> >
> > Fix assorted bugs by changing TS_execute's callback API to ternary
> logic.
> > ...
>
> I'm still working on creating a reproducible test-case without having to
> share company data. I'm also trying to understand the code as a fun
> exercise.
>
> I can at least share some of the queries and result data:
>
> DDL:
> CREATE TABLE IF NOT EXISTS sherlock_catalog (
> uri varchar,
> description varchar NOT NULL,
> metadata varchar NOT NULL,
> textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english',
> sherlock_normalize(uri || ' ' || description || ' ' || metadata))) STORED,
> last_seen timestamptz,
> PRIMARY KEY (uri)
> );
> CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS varchar
> AS $$
> BEGIN
> RETURN lower(regexp_replace(regexp_replace(regexp_replace(str,
> '[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1
> \2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g'));
> END;
> $$
> LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;
>
> Query:
> SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* &
> ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
> '%Discontinued%' LIMIT 2;
>
> Result: 2 rows
>
> Expected result: 0 rows, because the textsearch vector contains:
> 'discontinu':86
>
> Plan:
> Limit (cost=130.61..12789.24 rows=1 width=136)
> -> Bitmap Heap Scan on sherlock_catalog (cost=130.61..12789.24 rows=1
> width=136)
> Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* &
> !''discontinu'':*'::tsquery)
> Filter: ((metadata)::text ~~ '%Discontinued%'::text)
> -> Bitmap Index Scan on sherlock_catalog_textsearch
> (cost=0.00..130.61 rows=3548 width=0)
> Index Cond: (textsearch @@ '''full'':* & ''suppli'':* &
> !''discontinu'':*'::tsquery)
>
> The generated plan is structurally the same for PostgreSQL 12.5
> respectively
> versions before that commit, but if I alter the plan using (SET
> enable_*scan
> = off) so that the planner comes up with a sequential scan, I will get the
> expected results.
>
> Some other results (count):
>
> Count of the same query:
> user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
> '''full'':*
> & ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
> '%Discontinu%';
> count
> -------
> 4962
> (1 Zeile)
>
>
> Negation, but not a prefix search:
> user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
> '''full'':*
> & ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE
> '%Discontinu%';
> count
> -------
> 0
> (1 Zeile)
>
>
> Without negation:
> user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
> '''full'':*
> & ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE
> '%Discontinu%';
> count
> -------
> 13127
> (1 Zeile)
>
> Without the "discontinu" term at all
>
> user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@
> '''full'':*
> & ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%';
> count
> -------
> 13127
> (1 Zeile)
>
>
> So it seems like the negated query manages to filter out some data, but not
> all - as if it failed to recheck and definitely determine the TS_YES
> respectively TS_NO answer from an in-precise TS_MAYBE answer from an
> unprecise index-based answer (without position information?)? if I even
> understand this remotely correctly, I'm new to this.
> I'll try to find out more and to prepare shareable data that reproduces the
> problem, but I also wonder if I manage to dive into the code and understand
> something about it :-)
>
> Kind Regards
> Dimitri Nüscheler
>
>

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Dimitri Nüscheler 2021-02-14 10:17:46 Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term
Previous Message PG Bug reporting form 2021-02-13 23:22:33 BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term