BUG #14202: pg_trgm: % uses incorrect comparison of similarity with the limit

From: contact(at)gregnavis(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #14202: pg_trgm: % uses incorrect comparison of similarity with the limit
Date: 2016-06-20 11:53:21
Message-ID: 20160620115321.5792.10766@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 14202
Logged by: Greg Navis
Email address: contact(at)gregnavis(dot)com
PostgreSQL version: 9.6beta1
Operating system: Fedora 23 (Linux 4.5.6-200.fc23.x86_64)
Description:

# Summary

Due to a bug gtrgm_consistent, using a GiST or GIN trigram index can return
extraneous rows whose trigram-similarity was below set_limit().

# Version tested

PostgreSQL Git commit: 4d48adc

$ gcc --version
gcc (GCC) 5.3.1 20160406 (Red Hat 5.3.1-6)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE.

$ uname -a
Linux localhost.localdomain 4.5.6-200.fc23.x86_64 #1 SMP Wed Jun 1 21:28:20
UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

# How to reproduce

Save the below to trgm_bug.sql:

-- Clean up after previous run.
DROP TABLE IF EXISTS restaurants;
DROP EXTENSION IF EXISTS pg_trgm CASCADE;

-- Create the table.
CREATE EXTENSION pg_trgm;
CREATE TABLE restaurants(city VARCHAR(255) NOT NULL);
CREATE INDEX restaurants_on_city_idx ON restaurants USING gist(city
gist_trgm_ops);

-- Insert 10000 rows to trigger index use.
INSERT INTO restaurants SELECT 'Warsaw' FROM generate_series(1, 10000);
INSERT INTO restaurants SELECT 'Szczecin' FROM generate_series(1, 10000);

-- Similarity of the two names (for reference).
SELECT similarity('Szczecin', 'Warsaw');

-- Set the limit to 0.3. Only Warsaw is returned (as it should).
SELECT set_limit(0.3);
SELECT DISTINCT city, similarity(city, 'Warsaw'), show_limit() FROM
restaurants WHERE city % 'Warsaw';

-- Raise the limit to 0.5. Now _both_ Warsaw and Szczecin are returned.
SELECT set_limit(0.5);
SELECT DISTINCT city, similarity(city, 'Warsaw'), show_limit() FROM
restaurants WHERE city % 'Warsaw';

and run "psql -f trgm_bug.sql".

# Results

The script returns the right result when the threshold is set to 0.3:

city | similarity | show_limit
--------+------------+------------
Warsaw | 1 | 0.3
(1 row)

However, it returns _both cities after rising the threshold to 0.5_:

city | similarity | show_limit
----------+------------+------------
Szczecin | 0 | 0.5
Warsaw | 1 | 0.5

# Root cause

The root cause is this line in contrib/pg_trgm/trgm_gist.c:

/* strange bug at freebsd 5.2.1 and gcc 3.3.3 */
res = (*(int *) &tmpsml == *(int *) &nlimit || tmpsml > nlimit);

nlimit is of type double. tmpsml is of type float4. On my system
sizeof(float4) == 4 and sizeof(double) == 8. After adding:

printf("tmpsml = %f tmpsml:int = %i nlimit = %f nlimit:int = %i res = %i\n",
tmpsml, *(int *)&tmpsml, nlimit, *(int *)&nlimit, res);

after the offending line I can see the following output (after passing
through "sort | uniq"):

tmpsml = 0.000000 tmpsml:int = 0 nlimit = 0.300000 nlimit:int = 858993459
res = 0
tmpsml = 0.000000 tmpsml:int = 0 nlimit = 0.500000 nlimit:int = 0 res = 1
tmpsml = 1.000000 tmpsml:int = 1065353216 nlimit = 0.300000 nlimit:int =
858993459 res = 1
tmpsml = 1.000000 tmpsml:int = 1065353216 nlimit = 0.500000 nlimit:int = 0
res = 1

On line 2, we can see that nlimit = 0.5 was converted to 0 which triggered
the error.

# Possible fix

If we're okay with abandoning the bug work-around for FreeBSD 5.2.1 / gcc
3.3.3 then I suggest replacing

res = (*(int *) &tmpsml == *(int *) &nlimit || tmpsml > nlimit);

with

res = tmpsml >= nlimit;

This resolves the issue.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2016-06-20 14:15:17 Re: BUG #14202: pg_trgm: % uses incorrect comparison of similarity with the limit
Previous Message Tom Lane 2016-06-19 20:42:24 Re: Line intersection point is wrong