Re: strange behavior of pg_trgm's similarity function

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: strange behavior of pg_trgm's similarity function
Date: 2013-10-10 12:12:46
Message-ID: 525699BE.9040508@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10.10.2013 15:03, Fujii Masao wrote:
> Hi,
>
> The behavior of pg_trgm's similarity function seems strange. Is this
> intentional?
>
> I was thinking that the following three calls of the similarity function return
> the same number because the second argument is just the three characters
> contained in the first argument in every calls.
>
> =# SELECT similarity('12345', '123');
> =# SELECT similarity('12345', '234');
> =# SELECT similarity('12345', '345');
>
> But that's not true. Each returns the different number.
>
> =# SELECT similarity('12345', '123');
> similarity
> ------------
> 0.428571
> (1 row)
>
> =# SELECT similarity('12345', '234');
> similarity
> ------------
> 0.111111
> (1 row)
>
> =# SELECT similarity('12345', '345');
> similarity
> ------------
> 0.25
> (1 row)
>
> This happens because, for example, similarity('12345', '123') returns
> the similarity number of '**12345*' and '**123*' (* means the blank character),
> NOT '12345' and '123'. IOW, two and one blank characters are added into
> the heading and tailing of each argument, respectively. I wonder why
> pg_trgm's similarity function works in this way. We should change this
> so that no blank characters are added into the arguments?

Well, you could also argue that "111111" and "222222" are quite similar,
even though pg_trgm's similarity will not think so. It comes down to the
definition of similarity, and how well that definition matches your
intuition.

FWIW, it feels right to me that a match in the beginning of a word is
worth more than one in the middle of a string. -1 on changing that.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2013-10-10 12:31:20 Re: Auto-tuning work_mem and maintenance_work_mem
Previous Message Fujii Masao 2013-10-10 12:03:26 strange behavior of pg_trgm's similarity function