From: | Dang Minh Huong <kakalot49(at)gmail(dot)com> |
---|---|
To: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Kha Nguyen <nlhkha(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Extra Vietnamese unaccent rules |
Date: | 2017-05-28 07:55:07 |
Message-ID: | D367CC2F-5595-4370-827A-C439C0361979@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I am interested in this thread.
> On May 27, 29 Heisei, at 10:41, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:
>
> On Fri, May 26, 2017 at 5:48 PM, Thomas Munro
> <thomas(dot)munro(at)enterprisedb(dot)com> wrote:
>> Unicode has two ways to represent characters with accents: either with
>> composed codepoints like "é" or decomposed codepoints where you say
>> "e" and then "´". The field "00E2 0301" is the decomposed form of
>> that character above. Our job here is to identify the basic letter
>> that each composed character contains, by analysing the decomposed
>> field that you see in that line. I failed to realise that characters
>> with TWO accents are described as a composed character with ONE accent
>> plus another accent.
>
> Doesn't that depend on the NF operation you are working on? With a
> canonical decomposition it seems to me that a character with two
> accents can as well be decomposed with one character and two composing
> character accents (NFKC does a canonical decomposition in one of its
> steps).
>
>> You don't have to worry about decoding that line, it's all done in
>> that Python script. The problem is just in the function
>> is_letter_with_marks(). Instead of just checking if combining_ids[0]
>> is a plain letter, it looks like it should also check if
>> combining_ids[0] itself is a letter with marks. Also get_plain_letter
>> would need to be able to recurse to extract the "a".
>
Thanks for reporting and lecture about unicode.
I attached a patch as the instruction from Thomas. Could you confirm it.
> Actually, with the recent work that has been done with
> unicode_norm_table.h which has been to transpose UnicodeData.txt into
> user-friendly tables, shouldn't the python script of unaccent/ be
> replaced by something that works on this table? This does a canonical
> decomposition but just keeps the first characters with a class
> ordering of 0. So we have basic APIs able to look at UnicodeData.txt
> and let caller do decision making with the result returned.
> --
> Michael
Thanks, i will learning about it.
---
Dang Minh Huong
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2017-05-28 10:37:54 | Re: Broken hint bits (freeze) |
Previous Message | Mark Kirkwood | 2017-05-28 07:01:58 | Re: logical replication - still unstable after all these months |