From: | contact(at)gregnavis(dot)com |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | BUG #14202: pg_trgm: % uses incorrect comparison of similarity with the limit |
Date: | 2016-06-20 11:53:21 |
Message-ID: | 20160620115321.5792.10766@wrigleys.postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
The following bug has been logged on the website:
Bug reference: 14202
Logged by: Greg Navis
Email address: contact(at)gregnavis(dot)com
PostgreSQL version: 9.6beta1
Operating system: Fedora 23 (Linux 4.5.6-200.fc23.x86_64)
Description:
# Summary
Due to a bug gtrgm_consistent, using a GiST or GIN trigram index can return
extraneous rows whose trigram-similarity was below set_limit().
# Version tested
PostgreSQL Git commit: 4d48adc
$ gcc --version
gcc (GCC) 5.3.1 20160406 (Red Hat 5.3.1-6)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.
$ uname -a
Linux localhost.localdomain 4.5.6-200.fc23.x86_64 #1 SMP Wed Jun 1 21:28:20
UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
# How to reproduce
Save the below to trgm_bug.sql:
-- Clean up after previous run.
DROP TABLE IF EXISTS restaurants;
DROP EXTENSION IF EXISTS pg_trgm CASCADE;
-- Create the table.
CREATE EXTENSION pg_trgm;
CREATE TABLE restaurants(city VARCHAR(255) NOT NULL);
CREATE INDEX restaurants_on_city_idx ON restaurants USING gist(city
gist_trgm_ops);
-- Insert 10000 rows to trigger index use.
INSERT INTO restaurants SELECT 'Warsaw' FROM generate_series(1, 10000);
INSERT INTO restaurants SELECT 'Szczecin' FROM generate_series(1, 10000);
-- Similarity of the two names (for reference).
SELECT similarity('Szczecin', 'Warsaw');
-- Set the limit to 0.3. Only Warsaw is returned (as it should).
SELECT set_limit(0.3);
SELECT DISTINCT city, similarity(city, 'Warsaw'), show_limit() FROM
restaurants WHERE city % 'Warsaw';
-- Raise the limit to 0.5. Now _both_ Warsaw and Szczecin are returned.
SELECT set_limit(0.5);
SELECT DISTINCT city, similarity(city, 'Warsaw'), show_limit() FROM
restaurants WHERE city % 'Warsaw';
and run "psql -f trgm_bug.sql".
# Results
The script returns the right result when the threshold is set to 0.3:
city | similarity | show_limit
--------+------------+------------
Warsaw | 1 | 0.3
(1 row)
However, it returns _both cities after rising the threshold to 0.5_:
city | similarity | show_limit
----------+------------+------------
Szczecin | 0 | 0.5
Warsaw | 1 | 0.5
# Root cause
The root cause is this line in contrib/pg_trgm/trgm_gist.c:
/* strange bug at freebsd 5.2.1 and gcc 3.3.3 */
res = (*(int *) &tmpsml == *(int *) &nlimit || tmpsml > nlimit);
nlimit is of type double. tmpsml is of type float4. On my system
sizeof(float4) == 4 and sizeof(double) == 8. After adding:
printf("tmpsml = %f tmpsml:int = %i nlimit = %f nlimit:int = %i res = %i\n",
tmpsml, *(int *)&tmpsml, nlimit, *(int *)&nlimit, res);
after the offending line I can see the following output (after passing
through "sort | uniq"):
tmpsml = 0.000000 tmpsml:int = 0 nlimit = 0.300000 nlimit:int = 858993459
res = 0
tmpsml = 0.000000 tmpsml:int = 0 nlimit = 0.500000 nlimit:int = 0 res = 1
tmpsml = 1.000000 tmpsml:int = 1065353216 nlimit = 0.300000 nlimit:int =
858993459 res = 1
tmpsml = 1.000000 tmpsml:int = 1065353216 nlimit = 0.500000 nlimit:int = 0
res = 1
On line 2, we can see that nlimit = 0.5 was converted to 0 which triggered
the error.
# Possible fix
If we're okay with abandoning the bug work-around for FreeBSD 5.2.1 / gcc
3.3.3 then I suggest replacing
res = (*(int *) &tmpsml == *(int *) &nlimit || tmpsml > nlimit);
with
res = tmpsml >= nlimit;
This resolves the issue.
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2016-06-20 14:15:17 | Re: BUG #14202: pg_trgm: % uses incorrect comparison of similarity with the limit |
Previous Message | Tom Lane | 2016-06-19 20:42:24 | Re: Line intersection point is wrong |