From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | r d <rd0002(at)gmail(dot)com> |
Cc: | pgsql-general(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: [GENERAL] fuzzystrmatch module buggy? observations |
Date: | 2012-11-06 22:43:50 |
Message-ID: | 20121106224350.GA24956@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-general |
On Tue, Oct 30, 2012 at 02:29:09PM +0100, r d wrote:
> The fuzzystrmatch module (http://www.postgresql.org/docs/9.2/static/
> fuzzystrmatch.html) is currently, as of 9.2.1, documented with the caution "At
> present, the soundex, metaphone, dmetaphone, and dmetaphone_alt functions do
> not work well with multibyte encodings (such as UTF-8)".
>
> While the venerable algorithms contained in the module seem to generally work
> for Latin strings from European languages which all have accented/diacritic
> characters such as äöüñáéíóúàèìòù, for languages with non-Latin characters such
> as Kyrillic, Hebrew, Arabic, Chinese, these venerable algorithms return NULL
> (empty) or plain weirdness.
>
> Some examples:
>
> dmetaphone ('Новости') = 'NN'
> soundex ('Новости') = NULL
>
> dmetaphone ('לפחות') = NULL
> soundex ('לפחות') = NULL
>
> soundex ('相关搜索') = NULL
> dmetaphone ('相关搜索') = NULL
>
> metaphone() crashes with SQL state: 42883 for all these strings (it tells me I
> should cast the 'unknown' input).
>
> The string 'äöüñáéíóúàèìòù' causes metaphone(), dmetaphone(), dmetaphone_alt,
> soundex() to fail.
>
> Only levenshtein() appears to function correctly with all above inputs, even
> when I let it compare Hebrew against Chinese strings.
>
> Summarizing my experience:
> * for english (ASCII equivalent), the module works,
> * for the rest of the Latin charsets (equivalent to ISO 8859-x) the module
> works unreliably,
> * for non-latin chars (UTF8 with 2-4 bytes per char) the module does not work
>
> Note: My DB and the OS are set up for UTF-8.
>
> This would appear to be less a problem of Postgresql and the fuzzystrmach
> module itself but because there
> appear to exist no replacement algorithms adequate for a multilingual world -
> at least that is my impression
> after looking at the IPA and http://www.lt-world.org websites and branching out
> from there.
This is a very good summary. I was not aware of all these behaviors.
> Given all this I have no idea of this is a bug at all or the state-of-the-art
> around this topic is inadequate.
I have no idea either.
> Questions (to the developers):
> - Is there anything in work or planned for the fuzzystrmatch module?
> - Does anybody know about adequate replacements or upgrades of the soundex,
> metaphone etc. algorithms from academia?
I have not heard of anyone working in this area. What usually happens
is some expert in the field shows up and submits a patch to improve it.
--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
From | Date | Subject | |
---|---|---|---|
Next Message | Denis Feklushkin | 2012-11-07 02:30:28 | Bug reporting form :-( |
Previous Message | Merlin Moncure | 2012-11-06 21:30:11 | Re: Introducing floating point cast into filter drastically changes row estimate |
From | Date | Subject | |
---|---|---|---|
Next Message | Mike Lewis | 2012-11-06 22:55:40 | Comparing txid_current() to xmin |
Previous Message | Vick Khera | 2012-11-06 19:33:43 | Re: Memory issue on FreeBSD |