Re: BUG #13440: unaccent does not remove all diacritics

From: Léonard Benedetti <benedetti(at)mlpo(dot)fr>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Gradek <mike(at)busbud(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13440: unaccent does not remove all diacritics
Date: 2016-01-24 03:18:07
Message-ID: 56A4426F.2040108@mlpo.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Le 19/06/2015 04:00, Thomas Munro a écrit :
> On Fri, Jun 19, 2015 at 7:30 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I took a quick look at this list and it seems fairly sane as far as
>> the automatically-generated items go, except that I see it hits a few
>> LIGATURE cases (including the existing ij cases, but also fi fl and
>> ffl). I'm still quite dubious that that is appropriate; at least, if
>> we do it I think we should be expanding out to the equivalent
>> multi-letter form, not simply taking one of the letters and dropping
>> the rest. Anybody else have an opinion on how to handle ligatures?
> Here is a version that optionally expands ligatures if asked to with
> --expand-ligatures. It uses the Unicode 'general category' data to
> identify and strip diacritical marks and distinguish them from
> ligatures which are expanded to all their parts. It meant I had to
> load a bunch of stuff into memory up front, but this approach can
> handle an awkward bunch of ligatures whose component characters have
> marks: DŽ, Dž, dž -> DZ, Dz, dz. (These are considered to be single
> characters to maintain a one-to-one mapping with certain Cyrillic
> characters in some Balkan countries which use or used both scripts.)
>
> As for whether we *should* expand ligatures, I'm pretty sure that's
> what I'd always want, but my only direct experience of languages with
> ligatures as part of the orthography (rather than ligatures as a
> typesetting artefact like ffl et al) is French, where œ is used in the
> official spelling of a bunch of words like œil, sœur, cœur, œuvre when
> they appear in books, but substituting oe is acceptable on computers
> because neither the standard French keyboard nor the historically
> important Latin1 character set includes the character. I'm fairly
> sure the Dutch have a similar situation with IJ, it's completely
> interchangeable with the sequence IJ.
>
> So +1 from me for ligature expansion. It might be tempting to think
> that a function called 'unaccent' should only remove diacritical
> marks, but if we are going to be pedantic about it, not all
> diacritical marks are actually accents anyway...
>
>> The manually added special cases don't look any saner than they did
>> before :-(. Anybody have an objection to removing those (except maybe
>> dotless i) in HEAD?
> +1 from me for getting rid of the bogus œ->e, IJ -> I, ... transformations, but:
>
> 1. For some reason œ, æ (and uppercase equivalents) don't have
> combining character data in the Unicode file, so they still need to be
> treated as special cases if we're going to include ligatures. Their
> expansion should of course be oe and ae rather that what we have.
> 2. Likewise ß still needs special treatment (it may be historically
> composed of sz but Unicode doesn't know that, it's its own character
> now and expands to ss anyway).
> 3. I don't see any reason to drop the Afrikaans ʼn, though it should
> surely be expanded to 'n rather than n.
> 4. I have no clue about whether the single Cyrillic item in there
> belongs there.
>
> Just by the way, there are conventional rules for diacritic removal in
> some languages, like ä, ö, ü -> ae, oe, ue in German, å -> aa in
> Scandinavian languages and è -> e' in Italian. A German friend of
> mine has a ü in his last name and he finishes up with any of three
> possible spellings of his name on various official documents, credit
> cards etc as a result! But these sorts of things are specific to
> individual languages and don't belong in a general accent removal rule
> file (it would be inappropriate to convert French aigüe to aiguee or
> Spanish pingüino to pingueino). I guess speakers of those languages
> could consider submitting rules files for language-specific
> conventions like that.
>
I use "unaccent" and I am very pleased with the applied patches for the
default rules and the Python script to generate them.

But as you pointed out, the "extra cases" (the subset of characters
which is not generated by the script, but hardcoded) are pretty
disturbing. The main problem to me is that it lacks a number of "extra
cases". In fact, the script manages arbitrarily few ligatures but leaves
many things aside. So I looked for a way to improve the generation, to
avoid having this trouble.

As you said, some characters don't have Unicode decomposition. So, to
handle all these cases, we can use the standard Unicode transliterator
Latin-ASCII (available in CLDR), it associates Unicode characters to
ASCII-range equivalent. This approach seems much more elegant, this
avoids hardcoded cases and transliterations are semantically correct (at
least, as much as they can).

So, I modified the script: the arguments of the command line are used to
pass the file path of the transliterator (available as an XML file in
Unicode Common Locale Data Repository), so you find attached the new
script and the generated output for convenience, I will also propose a
patch for Commitfest. Note that the script now takes (at most) two input
files: UnicodeData.txt and (optionally) the XML file of the transliterator.

By the way, I took the opportunity to make the script more user-friendly
by several surface changes. There is now a very light support for
command line arguments with help messages. The text file was, before,
passed to the script on standard input; this approach is not appropriate
when two files must be used. So as I mentioned, the arguments of the
command line are now used to pass the paths.

Finally, the use of this transliterator increase inevitably the number
of characters handled, I do not think it's a problem (there is 1044
characters handled), on the contrary, and after several tests on index
generations, I have no significant performance difference. Nonetheless,
using the transliterator remains optional and a command line option is
available to disable it (so one can easily generate a small rules file,
if desired). It seemed however logical to me to keep it on by default:
that is, a priori, the desired behavior.

Léonard Benedetti

Attachment Content-Type Size
unaccent.rules text/plain 6.2 KB
contrib_unaccent_generate_unaccent_rules.py text/x-python 8.5 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Léonard Benedetti 2016-01-24 03:47:40 Re: BUG #13440: unaccent does not remove all diacritics
Previous Message Michael Paquier 2016-01-23 15:08:05 Re: Re[2]: [BUGS] Wal sender segfault