From: | Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru> |
---|---|
To: | Jan Przemysław Wójcik <jan(dot)przemyslaw(dot)wojcik(at)gmail(dot)com>, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>, Cristiano Coelho <cristianocca(at)hotmail(dot)com>, pgsql-bugs(at)postgresql(dot)org, François CHAHUNEAU <Francois(dot)CHAHUNEAU(at)numen(dot)fr>, Thierry BOUDIERE <Thierry(dot)BOUDIERE(at)numen(dot)fr> |
Subject: | Fwd: pg_trgm word_similarity inconsistencies or bug |
Date: | 2017-11-06 11:34:24 |
Message-ID: | CAPpHfdsDtQJ+pYfapH7GMt9n4RLByKMzHEBWcPL99C_g+rkUKQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
Hi!
I'd like to forward a feedback from our customer who uses word_similarity()
function.
François finds current behavior of word_similarity() to be useful. Thus, I
think we should preserve it. But documentation correction is needed and
option for alternative behavior would be useful too.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
---------- Forwarded message ----------
From: François CHAHUNEAU <Francois(dot)CHAHUNEAU(at)numen(dot)fr>
Date: Wed, Nov 1, 2017 at 1:04 AM
Subject: RE: [BUGS] pg_trgm word_similarity inconsistencies or bug
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: Thierry BOUDIERE <Thierry(dot)BOUDIERE(at)numen(dot)fr>, "foli(at)numen(dot)mg" <
foli(at)numen(dot)mg>
Hello Alexander,
We agree that the current pg_trgm documentation does not correctly reflect
the de facto behavior of word_similarity(), and that something has to be
changed. But to us, it is more a documentation problem than anything else.
What is computed is still « substring_similarity » as was initially
specified between us, but it is influenced by a strong word boundary bias
caused by the way trigrams are padded at word boundaries. This bias was
noticed by early reviewers and you explained that this motivated the name
switch to « word_similarity ». As you will remember, at the time we
discovered this, we were suprised because we considerd this as a slight
misnomer. Indeed, what is currently described in the 9.6 pg_trgm
documentation is inaccurate (although seemingly consistent with this new
name) and has to be amended.
Now, word_similarity() has been out for more than a year and, of course, it
is preferable to avoid any breaking changes… In our case, we consider the
name « unfortunate » and the explanation buggy, not the function itself.
As you may remember from the initial discussion, some other users stressed
the importance to be able to matchsub strings. We tend to agree with what
Jeff Janes wrote in this discussion :
The reason I like the option of not treating word boundaries as
special in this case is that often in scientific vocabulary, and in
catalog part numbers, people are pretty inconsistent about whether
they included spaces. "HEK 293", "HEK293", and "HEK-293" could be all
the same thing. So I like to strip out spaces and punctuation on both
sides of operator. Of course I can't do that if there are invisible
un-removable spaces on the substring side.
But, It doesn't sound like I am going to win that debate. Given that,
I don't think we need a different name for the function. I'm fine with
explaining the word-boundary subtlety in the documentation, and
keeping the function name itself simple.
Now, considering your proposal :
As far as we are concerned, we use <% and %> everyday for efficient fuzzy
matching on large databases. Our typical usage scenario is matching noisy
OCRized text strings against reference databases.
*> 1) Define GUC variable which specifies whether word_similarity() should
force extent boundaries to be at word boundaries,*
Ok for us,* iff* default behavior remains the same as now, for backward
compatibility reasons. We could take advantage, *in some cases*, of the new
« word rounded » behavior controlled by the GUC variable, but this would
not cover all scenarios currently in use.
> 2*) Document both cases of word_similarity() behavior.*
This is clearly needed anyway.
Best regards,
*François CHAHUNEAU*
Directeur des technologies
NUMEN DIGITAL| 24, rue Marc Seguin
<https://maps.google.com/?q=24,+rue+Marc+Seguin+75018+Paris+France&entry=gmail&source=g>
75018
<https://maps.google.com/?q=24,+rue+Marc+Seguin+75018+Paris+France&entry=gmail&source=g>
Paris
<https://maps.google.com/?q=24,+rue+Marc+Seguin+75018+Paris+France&entry=gmail&source=g>
France
<https://maps.google.com/?q=24,+rue+Marc+Seguin+75018+Paris+France&entry=gmail&source=g>*
| www.numen.fr
<https://numen.letsignit.com/r/0/991c6b92-d8fe-4afa-95f5-7b74d0322fd9>*
Tel +33 1 40 37 95 03 <+33%201%2040%2037%2095%2003> | Mob +33 6 07 85 21 79
<+33%206%2007%2085%2021%2079> | Fax +33 1 40 37 94 94
<+33%201%2040%2037%2094%2094>
<https://numen.letsignit.com/r/15/57dd0ced-dea8-441a-a066-68bf7cedbecd>
<https://numen.letsignit.com/r/3/9be1fd6e-57d8-4963-bcc7-03151b263433> Pensez
vert, n’imprimez que nécessaire. Les informations contenues dans le présent
e-mail sont exclusivement adressées au(x) destinataire(s) de ce message et
peuvent contenir des informations confidentielles, protégées par un secret
professionnel. L’utilisation de ces informations par d’autres personnes que
le(s) destinataire(s) est strictement interdite. Si vous n’êtes pas
destinataire de ce message, la publication, la reproduction, la diffusion
et /ou la distribution de ces informations auprès de tiers n’est pas
autorisée. Si vous avez reçu cet e-mail par erreur, veuillez nous en
informer immédiatement, détruire l'email, ses copies et documents joints et
le supprimer.
*De :* Alexander Korotkov [mailto:a(dot)korotkov(at)postgrespro(dot)ru]
*Envoyé :* mardi 31 octobre 2017 16:18
*À :* Thierry BOUDIERE <Thierry(dot)BOUDIERE(at)numen(dot)fr>; François CHAHUNEAU <
Francois(dot)CHAHUNEAU(at)numen(dot)fr>
*Objet :* Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug
Dear, Thierry and François!
PostgreSQL users found inconsistency between documentation and
implementation of word_similarity().
Possible solution proposed by the reporter is to alter the implementation.
But it's important for me that your interests are not affected but
potential further change of implementation of word_similarity().
Could you please share your opinion on changes proposed by Jan in the
pgsql-bugs mailing list?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2017-11-06 14:43:38 | Re: BUG #14889: explain analyze is taking much more time than actual execution |
Previous Message | David Rowley | 2017-11-06 08:08:59 | Re: BUG #14889: explain analyze is taking much more time than actual execution |
From | Date | Subject | |
---|---|---|---|
Next Message | Raúl Marín Rodríguez | 2017-11-06 11:41:06 | Re: pow support for pgbench |
Previous Message | Alexander Korotkov | 2017-11-06 11:14:51 | Re: Display number of heap accesses for index scans |