From: | Hugh Ranalli <hugh(at)whtc(dot)ca> |
---|---|
To: | Daniel Verite <daniel(at)manitou-mail(dot)org> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Date: | 2018-12-14 22:42:05 |
Message-ID: | CAAhbUMNqJXTN+_vYdi5L4CLjoq9OCG29V597RKrCQ7xKsCAejA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
I've attached a patch removes combining diacriticals. As with Latin and
Greek letters, it uses ranges to restrict its activity.
I have not submitted a patch for unaccent.rules, as it seems that a rules
file generated from generate_unaccent_rules.py will actually remove a large
number of rules (even before my changes), such as replacing the copyright
symbol © with (C), as well as other accented characters. It's probably
worth asking if the shipped unaccent.rules should correspond to what the
shipped generation utility produces, or not. I was surprised to see that it
didn't.
Please let me know if you see anything I need to change.
Best wishes,
Hugh
--
Hugh Ranalli
Principal Consultant
White Horse Technology Consulting
e: hugh(at)whtc(dot)ca
c: +01-416-994-7957
w: www.whtc.ca
On Thu, 13 Dec 2018 at 13:50, Hugh Ranalli <hugh(at)whtc(dot)ca> wrote:
>
>
> On Thu, 13 Dec 2018, 11:26 Daniel Verite <daniel(at)manitou-mail(dot)org wrote:
>
>> Tom Lane wrote:
>>
>> > Hm, I thought the OP's proposal was just to make unaccent drop
>> > combining diacriticals independently of context, which'd avoid the
>> > combinatorial-growth problem.
>>
>
> That's what I was thinking. Given that the accent is separate from the
> characters, simply dropping it should result in the correct unaccented
> character.
>
>>
>> In that case, this could be achieved by simply appending the
>> diacriticals themselves to unaccent.rules, since replacement of a
>> string by an empty string is already supported as a rule.
>> It doesn't seem like the current file has any of these, but from
>> https://www.postgresql.org/docs/11/unaccent.html :
>>
>> "Alternatively, if only one character is given on a line, instances
>> of that character are deleted; this is useful in languages where
>> accents are represented by separate characters"
>>
>
> Yes, I had read that in the docs, and that's the approach I planned to
> take. I'll go ahead and develop a patch, then.
>
> Best wishes,
> Hugh
>
>>
Attachment | Content-Type | Size |
---|---|---|
remove-combining-diacritical-accents-in-unaccent.rules.patch | text/x-patch | 2.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2018-12-14 22:50:03 | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Previous Message | Jean-Marc Lessard | 2018-12-14 21:57:41 | RE: BUG #15553: "ERROR: cache lookup failed for type 2" with a function the first time it run. |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2018-12-14 22:50:03 | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Previous Message | Robert Haas | 2018-12-14 22:24:34 | Re: 'infinity'::Interval should be added |