From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Gradek <mike(at)busbud(dot)com>, PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org> |
Subject: | Re: BUG #13440: unaccent does not remove all diacritics |
Date: | 2015-06-16 03:30:53 |
Message-ID: | CAEepm=2XAMTA8r3V682_aNZOz1kB3MdMvymDQmO7TA0qg99GAA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Tue, Jun 16, 2015 at 8:07 AM, Thomas Munro
<thomas(dot)munro(at)enterprisedb(dot)com> wrote:
> On Tue, Jun 16, 2015 at 12:55 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
>>> My terminal shows these characters to be different. One is
>>> http://graphemica.com/%C8%9B
>>> latin small letter t with comma below (U+021B)
>>
>>> The other is
>>> http://graphemica.com/%C5%A3
>>> latin small letter t with cedilla (U+0163)
>>
>> Ah-hah -- I did not look closely enough. So the immediate answer for
>> Michael is to add another entry to his unaccent.rules file.
>>
>> Should we add the missing character to the standard unaccent.rules file?
>
> It looks like Romanian also has s with comma. Perhaps we should have
> all these characters:
>
> $ curl -s http://unicode.org/Public/7.0.0/ucd/UnicodeData.txt | egrep
> ';LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH ' | wc -l
> 702
Here is an unaccent.rules file that maps those 702 characters from
Unicode 7.0 with names like "LATIN (SMALL|CAPITAL) LETTER [A-Z] WITH
..." to their base letter, plus 14 extra cases to match the existing
unaccent.rules file. If you sort and diff this and the existing file,
you can see that this file only adds new lines. Also, here is the
script I used to build it from UnicodeData.txt.
--
Thomas Munro
http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
unaccent.rules | application/octet-stream | 3.9 KB |
make_rules.py | text/x-python-script | 1.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2015-06-16 03:35:12 | Re: BUG #13442: ISBN doesn't always roundtrip with text |
Previous Message | Michael Gradek | 2015-06-16 01:58:04 | Re: BUG #13440: unaccent does not remove all diacritics |