From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Nguyen Le Hoang Kha <nlhkha(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Extra Vietnamese unaccent rules |
Date: | 2017-05-26 18:19:37 |
Message-ID: | CAEepm=39zN5tkbWPVUMifK9uk+rVkyEaXDs-y+DO2R+CtUUEBA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, May 27, 2017 at 5:13 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> Nguyen Le Hoang Kha <nlhkha(at)gmail(dot)com> writes:
>>> Most of the time in Vietnamese language, there are up to 2 accents in a
>>> character. These unaccent rules are added to handle such cases (which are
>>> very common).
>
>> I can't see any reason not to add these --- any objections out there?
>
> Oh, wait a minute. Patching unaccent.rules directly isn't the way
> to do this; that file is supposed to be generated by
> generate_unaccent_rules.py. Can you see how to modify that script
> to produce these rules?
Looking at one example from this patch:
UTF8: <E1><BA><A5>
Codepoint: 1EA5
Name: LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
In UnicodData.txt it's this line:
1EA5;LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE;Ll;0;L;00E2
0301;;;;N;;;1EA4;;1EA4
The problem is that generate_unaccent_rules.py assumes that the
composing data is a plain letter followed by some number of
diacritical modifiers. That's true for the characters with a single
accent, but in this multi-accent case it's *composed* character 00E2
(LATIN SMALL LETTER A WITH CIRCUMFLEX) and a diacritical marker 0301
(COMBINING ACCENT ACUTE). So we need to teach it to be recursive.
--
Thomas Munro
http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2017-05-26 18:39:48 | Re: Broken hint bits (freeze) |
Previous Message | Michael Paquier | 2017-05-26 18:16:19 | Re: logical replication and PANIC during shutdown checkpoint in publisher |