From: | r d <rd0002(at)gmail(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org, pgsql-bugs(at)postgresql(dot)org |
Subject: | fuzzystrmatch module buggy? observations |
Date: | 2012-10-30 13:29:09 |
Message-ID: | CALtFtELVR8pL72vC83Qq2Tdax1-28xCGfmW8imNgQjhYxjP7=A@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-general |
The fuzzystrmatch module (
http://www.postgresql.org/docs/9.2/static/fuzzystrmatch.html) is currently,
as of 9.2.1, documented with the caution *"At present, the soundex,
metaphone, dmetaphone, and dmetaphone_alt functions do not work well with
multibyte encodings (such as UTF-8)"*.
While the venerable algorithms contained in the module *seem* to generally
work for Latin strings from European languages which all have
accented/diacritic characters such as äöüñáéíóúàèìòù, for languages with
non-Latin characters such as Kyrillic, Hebrew, Arabic, Chinese, these
venerable algorithms return NULL (empty) or plain weirdness.
Some examples:
dmetaphone ('Новости') = 'NN'
soundex ('Новости') = NULL
dmetaphone ('לפחות') = NULL
soundex ('לפחות') = NULL
soundex ('相关搜索') = NULL
dmetaphone ('相关搜索') = NULL
metaphone() crashes with SQL state: 42883 for all these strings (it tells
me I should cast the 'unknown' input).
The string 'äöüñáéíóúàèìòù' causes metaphone(), dmetaphone(),
dmetaphone_alt, soundex() to fail.
Only levenshtein() appears to function correctly with all above inputs,
even when I let it compare Hebrew against Chinese strings.
Summarizing my experience:
* for english (ASCII equivalent), the module works,
* for the rest of the Latin charsets (equivalent to ISO 8859-x) the module
works unreliably,
* for non-latin chars (UTF8 with 2-4 bytes per char) the module does not
work
Note: My DB and the OS are set up for UTF-8.
This would appear to be less a problem of Postgresql and the fuzzystrmach
module itself but because there
appear to exist no replacement algorithms adequate for a multilingual world
- at least that is my impression
after looking at the IPA and http://www.lt-world.org websites and branching
out from there.
Given all this I have no idea of this is a bug at all or the
state-of-the-art around this topic is inadequate.
Questions (to the developers):
- Is there anything in work or planned for the fuzzystrmatch module?
- Does anybody know about adequate replacements or upgrades of the soundex,
metaphone etc. algorithms from academia?
From | Date | Subject | |
---|---|---|---|
Next Message | Louis-Claude Canon | 2012-10-30 13:35:36 | Re: BUG #7623: Inconsistency on transaction isolation documentation |
Previous Message | dmigowski | 2012-10-30 11:49:21 | BUG #7629: Suboptimal query plan when index search is possible and an additional search operator is given. |
From | Date | Subject | |
---|---|---|---|
Next Message | Adrian Klaver | 2012-10-30 15:16:04 | Re: Select Query Implementation:Details needed |
Previous Message | Rodrigo Pereira da Silva | 2012-10-30 12:05:01 | Re: Too much clients connected to the PostgreSQL Database |