From: | PG Bug reporting form <noreply(at)postgresql(dot)org> |
---|---|
To: | pgsql-bugs(at)lists(dot)postgresql(dot)org |
Cc: | dimitri(dot)nuescheler(at)gmail(dot)com |
Subject: | BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term |
Date: | 2021-02-13 23:22:33 |
Message-ID: | 16865-4ffdc3e682e6d75b@postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
The following bug has been logged on the website:
Bug reference: 16865
Logged by: Dimitri Nüscheler
Email address: dimitri(dot)nuescheler(at)gmail(dot)com
PostgreSQL version: 13.1
Operating system: Debian
Description:
Hello
I'm writing a small search engine application that also supports negation
(exclude results that contain terms starting with string).
After upgrading from PostgreSQL 12.5 to 13.1 the negation within a tsquery
when matched to a tsvector using the @@ operator no longer works properly,
so I started bisecting the commit history and tracked it down to this commit
(see the query and expected and actual result further below):
> commit 2f2007fbb255be178aca586780967f43885203a7 (HEAD, refs/bisect/bad)
> Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Date: Fri Jul 24 15:26:51 2020 -0400
>
> Fix assorted bugs by changing TS_execute's callback API to ternary
logic.
> ...
I'm still working on creating a reproducible test-case without having to
share company data. I'm also trying to understand the code as a fun
exercise.
I can at least share some of the queries and result data:
DDL:
CREATE TABLE IF NOT EXISTS sherlock_catalog (
uri varchar,
description varchar NOT NULL,
metadata varchar NOT NULL,
textsearch tsvector GENERATED ALWAYS AS (to_tsvector('english',
sherlock_normalize(uri || ' ' || description || ' ' || metadata))) STORED,
last_seen timestamptz,
PRIMARY KEY (uri)
);
CREATE OR REPLACE FUNCTION sherlock_normalize(str varchar) RETURNS varchar
AS $$
BEGIN
RETURN lower(regexp_replace(regexp_replace(regexp_replace(str,
'[^a-zA-Z0-9]+', ' ', 'g'),'([a-z])([A-Z])','\1
\2','g'),'([A-Z][A-Z0-9])([a-z])','\1 \2','g'));
END;
$$
LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;
Query:
SELECT * FROM sherlock_catalog WHERE textsearch @@ '''full'':* &
''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinued%' LIMIT 2;
Result: 2 rows
Expected result: 0 rows, because the textsearch vector contains:
'discontinu':86
Plan:
Limit (cost=130.61..12789.24 rows=1 width=136)
-> Bitmap Heap Scan on sherlock_catalog (cost=130.61..12789.24 rows=1
width=136)
Recheck Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)
Filter: ((metadata)::text ~~ '%Discontinued%'::text)
-> Bitmap Index Scan on sherlock_catalog_textsearch
(cost=0.00..130.61 rows=3548 width=0)
Index Cond: (textsearch @@ '''full'':* & ''suppli'':* &
!''discontinu'':*'::tsquery)
The generated plan is structurally the same for PostgreSQL 12.5 respectively
versions before that commit, but if I alter the plan using (SET enable_*scan
= off) so that the planner comes up with a sequential scan, I will get the
expected results.
Some other results (count):
Count of the same query:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
4962
(1 Zeile)
Negation, but not a prefix search:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & !''discontinu'''::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
0
(1 Zeile)
Without negation:
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':* & ''discontinu'':*'::tsquery AND metadata LIKE
'%Discontinu%';
count
-------
13127
(1 Zeile)
Without the "discontinu" term at all
user=# SELECT COUNT(*) FROM sherlock_catalog WHERE textsearch @@ '''full'':*
& ''suppli'':*'::tsquery AND metadata LIKE '%Discontinu%';
count
-------
13127
(1 Zeile)
So it seems like the negated query manages to filter out some data, but not
all - as if it failed to recheck and definitely determine the TS_YES
respectively TS_NO answer from an in-precise TS_MAYBE answer from an
unprecise index-based answer (without position information?)? if I even
understand this remotely correctly, I'm new to this.
I'll try to find out more and to prepare shareable data that reproduces the
problem, but I also wonder if I manage to dive into the code and understand
something about it :-)
Kind Regards
Dimitri Nüscheler
From | Date | Subject | |
---|---|---|---|
Next Message | Dimitri Nüscheler | 2021-02-14 07:41:49 | Re: BUG #16865: Regression: GIN Negated prefix search returns results that contain the search term |
Previous Message | Viktor Semykin | 2021-02-13 22:08:44 | Re: An issue with missing rows |