From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | Léonard Benedetti <benedetti(at)mlpo(dot)fr> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Gradek <mike(at)busbud(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org> |
Subject: | Re: BUG #13440: unaccent does not remove all diacritics |
Date: | 2016-01-25 23:44:50 |
Message-ID: | CAEepm=3Th+3XRiOoXewLvL1DybCbKxjc0FE4o6XqaZZBLUSOvg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Sun, Jan 24, 2016 at 4:18 PM, Léonard Benedetti <benedetti(at)mlpo(dot)fr> wrote:
> I use "unaccent" and I am very pleased with the applied patches for the
> default rules and the Python script to generate them.
>
> But as you pointed out, the "extra cases" (the subset of characters
> which is not generated by the script, but hardcoded) are pretty
> disturbing. The main problem to me is that it lacks a number of "extra
> cases". In fact, the script manages arbitrarily few ligatures but leaves
> many things aside. So I looked for a way to improve the generation, to
> avoid having this trouble.
>
> As you said, some characters don't have Unicode decomposition. So, to
> handle all these cases, we can use the standard Unicode transliterator
> Latin-ASCII (available in CLDR), it associates Unicode characters to
> ASCII-range equivalent. This approach seems much more elegant, this
> avoids hardcoded cases and transliterations are semantically correct (at
> least, as much as they can).
Wow. It would indeed be nice to use this dataset rather than
maintaining the special cases for œ et al. It would also nice to pick
up all those other things like ©, ½, …, ≪, ≫ (though these stray a
little bit further from the functionality implied by unaccent's name).
I don't think this alone will completely get rid of the hardcoded
special cases though, because we have these two mappings which look
like Latin but are in fact Cyrillic and I assume we need to keep them:
Ё Е
ё е
Should we extend the composition data analysis to make these remaining
special cases go away? We'd need a definition of is_plain_letter that
returns True for 0415 so that 0401 can be recognised as 0415 + 0308.
Depending on how you do that, you could sweep in some more Cyrillic
mappings and a ton of stuff from other scripts that have precomposed
diacritic codepoints (Greek, Hebrew, Arabic, ...?), and we'd need
someone with knowledge of relevant languages to sign off on the result
-- so it might make sense to stick to a definition that includes just
Latin and Cyrillic for now.
(Otherwise it might be tempting to use *only* the transliterator
approach, but CLDR doesn't seem to have appropriate transliterator
files for other scripts. They have for example Cyrillic -> Latin, but
we'd want Cyrillic -> some-subset-of-Cyrillic, analogous to Latin ->
ASCII.)
> So, I modified the script: the arguments of the command line are used to
> pass the file path of the transliterator (available as an XML file in
> Unicode Common Locale Data Repository), so you find attached the new
> script and the generated output for convenience, I will also propose a
> patch for Commitfest. Note that the script now takes (at most) two input
> files: UnicodeData.txt and (optionally) the XML file of the transliterator.
>
> By the way, I took the opportunity to make the script more user-friendly
> by several surface changes. There is now a very light support for
> command line arguments with help messages. The text file was, before,
> passed to the script on standard input; this approach is not appropriate
> when two files must be used. So as I mentioned, the arguments of the
> command line are now used to pass the paths.
>
> Finally, the use of this transliterator increase inevitably the number
> of characters handled, I do not think it's a problem (there is 1044
> characters handled), on the contrary, and after several tests on index
> generations, I have no significant performance difference. Nonetheless,
> using the transliterator remains optional and a command line option is
> available to disable it (so one can easily generate a small rules file,
> if desired). It seemed however logical to me to keep it on by default:
> that is, a priori, the desired behavior.
+1
--
Thomas Munro
http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Vladimir Bilyak | 2016-01-26 06:35:22 | Re[2]: [BUGS] BUG #13889: psql doesn't exequte correct script |
Previous Message | Peter Geoghegan | 2016-01-25 22:42:04 | Re: BUG #13886: When INSERT ON CONFLICT DO UPDATE updates, it returns INSERT rather than UPDATE |