Quick Links

Re: pg_trgm word_similarity inconsistencies or bug

From:	Jan Przemysław Wójcik <jan(dot)przemyslaw(dot)wojcik(at)gmail(dot)com>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>, Cristiano Coelho <cristianocca(at)hotmail(dot)com>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject:	Re: pg_trgm word_similarity inconsistencies or bug
Date:	2017-10-31 13:02:27
Message-ID:	CA+YsLFp9scUNjjfYJC=84V9o-ZB66GWPq3e_Jveqs1bNejA4qA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs pgsql-hackers

2017-10-30 19:08 GMT+01:00 Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>:
>
> On Sat, Oct 28, 2017 at 11:22 AM, Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru> wrote:
>>
>> On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote:
>> > Hello all, this is related to postgres 9.6 (9.6.4) and a good description can be found here https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words
>> >
>> > But in summary, word_similarity doesn’t seem to do exactly what the docs say, since it will match trigrams from multiple words rather tan doing a word by word comparison.
>> >
>> > Below is a table with output and expected output, thanks to kiln from stackoverflow to provide it.
>> >
>>
>> Interesting. An klin's answer from stackoverflow.com is right.
>>
>> The initial example can be reduced to the next:
>>
>> =# select word_similarity('sage', 'age sag');
>> word_similarity
>> -----------------
>> 1
>>
>> It computes maximum similarity using closest trigrams not considering order of
>> 'sage' trigrams. It determines that all
>> trigrams from 'sage' match trigrams from 'age sag'.
>>
>> Initial order of 'age sag' trigrams:
>> ' a', ' ag', 'age', 'ge ', ' s', ' sa', 'sag', 'ag '
>> ^ ^
>> |from |to
>> Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams
>> continuously):
>> ' s', ' sa', 'age', 'ge ', 'sag'
>>
>> Maybe the problem should be solved by considering 'sage' trigrams
>> initial order.
>
>
> We searching for continuous extent of second string trigrams (in original orders) which has best similarity with first string trigrams.
> Possible solution could be forcing this extent boundaries to be at word boundaries. However, it would become less convenient to search for *part* of word. And we already have users adopt this feature.
> So, I see following solution:
> 1) Define GUC variable which specifies whether word_similarity() should force extent boundaries to be at word boundaries,
> 2) Document both cases of word_similarity() behavior.
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>

Look at the example:

with data(word, string) as (
values
('sage', 'message'),
('sage', 'message s'),
('sage', 'message sa')
)

select
similarity(word, string),
word_similarity (word, string)
from data;

similarity | word_similarity
------------+-----------------
0.3 | 0.6
0.363636 | 0.8
0.454545 | 1
(3 rows)

When searching for a part of a word I would expect that the word
similarity is the same in all three rows. It's really strange that the
context of the second word (sa) makes the similarity equal to 1.

From a user's point of view it's also hard to understand why there is
such a big difference between similarity() and word_similarity(),
especially when comparing just two words (the first row).

I do not think the current function has any practical use.

------
Jan Przemysław Wójcik

2017-10-30 19:08 GMT+01:00 Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>:
> On Sat, Oct 28, 2017 at 11:22 AM, Arthur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
> wrote:
>>
>> On Fri, Oct 27, 2017 at 06:48:08PM +0000, Cristiano Coelho wrote:
>> > Hello all, this is related to postgres 9.6 (9.6.4) and a good
>> > description can be found here
>> > https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words
>> >
>> > But in summary, word_similarity doesn’t seem to do exactly what the docs
>> > say, since it will match trigrams from multiple words rather tan doing a
>> > word by word comparison.
>> >
>> > Below is a table with output and expected output, thanks to kiln from
>> > stackoverflow to provide it.
>> >
>>
>> Interesting. An klin's answer from stackoverflow.com is right.
>>
>> The initial example can be reduced to the next:
>>
>> =# select word_similarity('sage', 'age sag');
>> word_similarity
>> -----------------
>> 1
>>
>> It computes maximum similarity using closest trigrams not considering
>> order of
>> 'sage' trigrams. It determines that all
>> trigrams from 'sage' match trigrams from 'age sag'.
>>
>> Initial order of 'age sag' trigrams:
>> ' a', ' ag', 'age', 'ge ', ' s', ' sa', 'sag', 'ag '
>> ^ ^
>> |from |to
>> Sorted 'sage' trigrams (all of them occured within 'age sag' trigrams
>> continuously):
>> ' s', ' sa', 'age', 'ge ', 'sag'
>>
>> Maybe the problem should be solved by considering 'sage' trigrams
>> initial order.
>
>
> We searching for continuous extent of second string trigrams (in original
> orders) which has best similarity with first string trigrams.
> Possible solution could be forcing this extent boundaries to be at word
> boundaries. However, it would become less convenient to search for *part*
> of word. And we already have users adopt this feature.
> So, I see following solution:
> 1) Define GUC variable which specifies whether word_similarity() should
> force extent boundaries to be at word boundaries,
> 2) Document both cases of word_similarity() behavior.
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>

In response to

Re: pg_trgm word_similarity inconsistencies or bug at 2017-10-30 18:08:21 from Alexander Korotkov

Responses

Re: pg_trgm word_similarity inconsistencies or bug at 2017-10-31 15:25:03 from Alexander Korotkov
Fwd: pg_trgm word_similarity inconsistencies or bug at 2017-11-06 11:34:24 from Alexander Korotkov

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Alexander Korotkov	2017-10-31 15:25:03	Re: pg_trgm word_similarity inconsistencies or bug
Previous Message	Stefan Tzeggai	2017-10-31 12:09:05	Re: Segfault 11 on PG10 with max_parallel_workers_per_gather>3

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	David Rowley	2017-10-31 13:47:32	Re: Removing [Merge]Append nodes which contain a single subpath
Previous Message	Simon Riggs	2017-10-31 12:50:59	Re: MERGE SQL Statement for PG11