From: | "Euler Taveira" <euler(at)eulerto(dot)com> |
---|---|
To: | bosamia(dot)karan(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #18580: The pg_similarity appears to be wrong |
Date: | 2024-08-15 15:13:06 |
Message-ID: | f90f08b5-ba6d-4b39-9653-ede0f70a1be9@app.fastmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Mon, Aug 12, 2024, at 6:58 AM, PG Bug reporting form wrote:
> SELECT *
> FROM (
> SELECT
> *,
> similarity(provision_clean_description, 'Policies The General Partner
> shall promptly notify the Investor of any proposed changes in the Funds
> leverage policies including adjustments to leverage ratios') AS sim_tim
> FROM provision_database
> ) pd
> WHERE sim_tim <= 1 and sim_tim > 0.7 and firm_id=18;
>
> This both sentences giving similarity score as 1 despite the fact that the
> sentence 1. has Policies as the starting word(do not include the starting
> hyphen in the sentences):
> - Policies The General Partner shall promptly notify the Investor of any
> proposed changes in the Funds leverage policies including adjustments to
> leverage ratios
> - The General Partner shall promptly notify the Investor of any proposed
> changes in the Funds leverage policies including adjustments to leverage
> ratios
>
This is not a bug.
That's how trigram works. The documentation [1] explains that the words don't
need to be in the same order because it counts the number of common trigrams.
Trigrams are extracted ignoring non-alphanumeric characters. Trigrams are
case-insensitive. You can check the trigrams extracted using the show_trgm()
function.
--
-- return the non-common trigrams
--
WITH a AS (
SELECT x FROM unnest(show_trgm('Policies The General Partner shall promptly
notify the Investor of any proposed changes in the Funds leverage policies
including adjustments to leverage ratios')) x),
b AS (
SELECT x FROM unnest(show_trgm('The General Partner shall promptly notify the
Investor of any proposed changes in the Funds leverage policies including
adjustments to leverage ratios')) x)
SELECT * FROM a FULL JOIN b ON (a.x = b.x) WHERE a.x IS NULL OR b.x IS NULL;
[1] https://www.postgresql.org/docs/current/pgtrgm.html#PGTRGM-CONCEPTS
--
Euler Taveira
EDB https://www.enterprisedb.com/
From | Date | Subject | |
---|---|---|---|
Next Message | Jacob Champion | 2024-08-15 17:52:32 | Re: TLS session tickets disabled? |
Previous Message | David Rowley | 2024-08-15 01:12:04 | Re: BUG #18558: ALTER PUBLICATION fails with unhelpful error on attempt to use system column |