From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Tasos Maschalidis <TaS(dot)O(dot)S(at)hotmail(dot)com>, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: BUG #15347: Unaccent for greek characters does not work |
Date: | 2018-08-24 03:32:28 |
Message-ID: | CAEepm=3bFBfv9CBC1r+n7_TsrsWY_JxFqAsUKUozKDvcbstdhw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Fri, Aug 24, 2018 at 2:07 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> writes:
>> On Fri, Aug 24, 2018 at 12:12 PM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>>> Perhaps it would be better to avoid non-ASCII characters in this script?
>
>> You mean in the Python script? Why? At the top it has a PEP-263
>> encoding declaration:
>> # -*- coding: utf-8 -*-
>
> What happens if someone tries to view this in a non-UTF8 encoding?
>
> As a comparison point, we generally avoid using non-ASCII characters
> directly in the SGML docs; we write out the appropriate SGML entity
> instead. I think we should try to do the equivalent thing here ---
> I assume python has some way to write "U+nnnn" or some such.
Ok, 2 against 1. Done.
I'll wait for other opinions on what to do about lower case sigma
before committing. I'm not keen on adding that special case because:
1. It's a new kind of thing: previously we did only accent and
ligature removal, but this is removal of variants that exist in only
one case. It's admittedly a bit like the German ß, which lacks an
upper case version according to some German speakers and undergoes a
lossy conversion to double-S, but that was already handled without a
special case by ligature expansion, so it's not the same thing.
2. We are down to only 5 hardcoded special cases: two Cyrillic
characters which I suspect will go away if we allow Cyrillic to be
processed via the general mechanism as we are doing here with Greek,
and 3 oddballs that we inherited from the old hand-maintained
unaccent.rules files: DEGREE CELSIUS, DEGREE FAHRENHEIT, and SOUND
RECORDING COPYRIGHT. I think the degrees signs can be done
automatically with just a bit more Unicode smarts, and I might try
reporting SOUND RECORDING COPYRIGHT as missing from
<character-fallback> to the CLDR project whose data we're using.
3. The problem seems to go away by itself if you convert to upper case.
--
Thomas Munro
http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
0001-Add-Greek-characters-to-unaccent.rules-v2.patch | application/octet-stream | 4.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | jimmy | 2018-08-24 06:04:24 | Re:Re: Re: Bug: ERROR: invalid cache ID: 42 CONTEXT: parallel worker |
Previous Message | Tom Lane | 2018-08-24 02:07:30 | Re: BUG #15347: Unaccent for greek characters does not work |