From: | Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru> |
---|---|
To: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru> |
Cc: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Fuzzy substring searching with the pg_trgm extension |
Date: | 2016-02-01 17:12:03 |
Message-ID: | 56AF91E3.3010908@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 29.01.2016 18:58, Artur Zakirov wrote:
> On 29.01.2016 18:39, Alvaro Herrera wrote:
>> Teodor Sigaev wrote:
>>>> The behavior of this function is surprising to me.
>>>>
>>>> select substring_similarity('dog' , 'hotdogpound') ;
>>>>
>>>> substring_similarity
>>>> ----------------------
>>>> 0.25
>>>>
>>> Substring search was desined to search similar word in string:
>>> contrib_regression=# select substring_similarity('dog' , 'hot
>>> dogpound') ;
>>> substring_similarity
>>> ----------------------
>>> 0.75
>>>
>>> contrib_regression=# select substring_similarity('dog' , 'hot dog
>>> pound') ;
>>> substring_similarity
>>> ----------------------
>>> 1
>>
>> Hmm, this behavior looks too much like magic to me. I mean, a substring
>> is a substring -- why are we treating the space as a special character
>> here?
>>
>
> I think, I can rename this function to subword_similarity() and correct
> the documentation.
>
> The current behavior is developed to find most similar word in a text.
> For example, if we will search just substring (not word) then we will
> get the following result:
>
> select substring_similarity('dog', 'dogmatist');
> substring_similarity
> ---------------------
> 1
> (1 row)
>
> But this is wrong I think. They are completely different words.
>
> For searching a similar substring (not word) in a text maybe another
> function should be added?
>
I have changed the patch:
1 - trgm2.data was corrected, duplicates were deleted.
2 - I have added operators <<-> and <->> with GiST index supporting. A
regression test will pass only with the patch
http://www.postgresql.org/message-id/CAPpHfdt19FwQXarYjkzxb3oxmv-KAn3FLuZrooARE_U3H3CV9g@mail.gmail.com
3 - the function substring_similarity() was renamed to subword_similarity().
But there is not a function substring_similarity_pos() yet. It is not
trivial.
--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachment | Content-Type | Size |
---|---|---|
pg_trgm_guc_v2.patch | text/x-patch | 8.8 KB |
pg_trgm_subword_v5.patch | text/x-patch | 111.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2016-02-01 18:03:45 | Re: Re: PATCH: Split stats file per database WAS: autovacuum stress-testing our system |
Previous Message | Fabien COELHO | 2016-02-01 16:46:45 | Re: pgbench stats per script & other stuff |