Quick Links

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Jan Przemysław Wójcik <jan(dot)przemyslaw(dot)wojcik(at)gmail(dot)com>, Cristiano Coelho <cristianocca(at)hotmail(dot)com>
Cc:	pgsql-bugs(at)postgresql(dot)org, François CHAHUNEAU <Francois(dot)CHAHUNEAU(at)numen(dot)fr>, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug
Date:	2017-12-07 13:38:59
Message-ID:	CAPpHfdtJ+JdeKUqBCOP_nHoDGs8iPsZSywUGJftLxOofehb96w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs pgsql-hackers

On Tue, Nov 7, 2017 at 7:24 PM, Alexander Korotkov <
a(dot)korotkov(at)postgrespro(dot)ru> wrote:

> On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik <
> jan(dot)przemyslaw(dot)wojcik(at)gmail(dot)com> wrote:
>
>> my statement about the function usefulness was probably too categorical,
>> though I had in mind the current name of the function.
>>
>> I'm afraid that creating a function that implements quite different
>> algorithms depending on a global parameter seems very hacky and would lead
>> to misunderstandings. I do understand the need of backward compatibility,
>> but I'd opt for the lesser evil. Perhaps a good idea would be to change
>> the
>> name to 'substring_similarity()' and introduce the new function
>> 'word_similarity()' later, for example in the next major version release.
>>
>
> Good point. I've no complaints about that. I'm going to propose
> corresponding patch to the next commitfest.
>

I've written a draft patch for fixing this inconsistency. Please, find it
in attachment. This patch doesn't contain proper documentation and
comments yet.

I've called existing behavior subset_similarity(). I didn't use name
substring_similarity(), because it doesn't really looking for substring
with appropriate padding, but rather searching for continuous subset of
trigrams. For index search over subset similarity, %>>, <<%, <->>>, <<<->
operators are provided. I've added extra arrow sign to denote these
operators look deeper into string.

Simultaneously, word_similarity() now forces extent bounds to be word
bounds. Now word_similarity() behaves similar to my_word_similarity()
proposed on stackoverlow.

The difference here is only in 'messsage s' row, because word_similarity()
allows matching one word to two or more while my_word_similarity() doesn't
allow that. In this case word_similarity() returns similarity between
'sage' and 'message s'.

# select similarity('sage', 'message s');
similarity
------------
0.363636
(1 row)

I think behavior of word_similarity() appears better here, because typo can
break word into two.

I also wonder if word_similarity() and subset_similarity() should share
same threshold value for indexed search. subset_similarity() typically
returns higher values than word_similarity(). Thus, it's probably makes
sense to split their threshold values.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
pg-trgm-word-subset-similarity-1.patch	application/octet-stream	57.5 KB

In response to

Re: Fwd: pg_trgm word_similarity inconsistencies or bug at 2017-11-07 16:24:27 from Alexander Korotkov

Responses

Re: [BUGS] pg_trgm word_similarity inconsistencies or bug at 2017-12-07 17:39:05 from François CHAHUNEAU

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Raghavendra Rao Jsv -X (rjsv - SCARLET WIRELESS INDIA PRIVATE LIMITED at Cisco)	2017-12-07 14:21:30	missing chunk number 0 for toast value 1086251 in pg_toast_2619
Previous Message	Jaroslav Urik	2017-12-07 13:36:15	Re: BUG #14949: array_append() - performance issues (in update)

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Joshua D. Drake	2017-12-07 14:32:58	Re: Logical replication without a Primary Key
Previous Message	Peter Eisentraut	2017-12-07 13:31:44	Re: [HACKERS] logical decoding of two-phase transactions