From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | hugh(at)whtc(dot)ca |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, daniel(at)manitou-mail(dot)org, PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Date: | 2018-12-18 04:05:00 |
Message-ID: | CAEepm=0qb_nx-f8cACS1=1NdmCj-3D9zXFU+RJHsFbZEztcqjg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs pgsql-hackers |
On Tue, Dec 18, 2018 at 12:03 PM Hugh Ranalli <hugh(at)whtc(dot)ca> wrote:
> On Mon, 17 Dec 2018 at 15:31, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Hugh Ranalli <hugh(at)whtc(dot)ca> writes:
>> > I've attached two patches, one to update generate_unaccent_rules.py, and
>> > another that updates unaccent.rules from the v34 transliteration file.
>>
>> I think you forgot the patches?
>
>
> Sigh, yes I did. That's what I get for trying to get this sent out before heading to an appointment. Patches attached and will add to CF. Let me know if you see anything amiss.
+ʹ '
+ʺ "
+ʻ '
+ʼ '
+ʽ '
+˂ <
+˃ >
+˄ ^
+ˆ ^
+ˈ '
+ˋ `
+ː :
+˖ +
+˗ -
+˜ ~
I don't think this is quite right. Those don't seem to be the
combining codepoints[1], and in any case they are being replaced with
ASCII characters, whereas I thought we wanted to replace them with
nothing at all. Here is my attempt to come up with a test case using
combining characters:
select unaccent('un café crème s''il vous plaît');
It's not stripping the accents. I've attached that in a file for
reference so you can run it with psql -f x.sql, and you can see that
it's using combining code points (code points 0301, 0300, 0302 which
come out as cc81, cc80, cc82 in UTF-8) like so:
$ xxd x.sql
00000000: 7365 6c65 6374 2075 6e61 6363 656e 7428 select unaccent(
00000010: 2775 6e20 6361 6665 cc81 2063 7265 cc80 'un cafe.. cre..
00000020: 6d65 2073 2727 696c 2076 6f75 7320 706c me s''il vous pl
00000030: 6169 cc82 7427 293b 0a0a ai..t');..
(To come up with that I used the trick of typing ":%!xxd" and then
when finished ":%!xxd -r", to turn vim into a hex editor.)
[1] https://en.wikipedia.org/wiki/Combining_Diacritical_Marks
--
Thomas Munro
http://www.enterprisedb.com
Attachment | Content-Type | Size |
---|---|---|
x.sql | application/octet-stream | 58 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2018-12-18 04:10:25 | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Previous Message | Amit Langote | 2018-12-18 03:24:54 | Re: BUG #15552: Unexpected error in COPY to a foreign table in a transaction |
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2018-12-18 04:10:25 | Re: BUG #15548: Unaccent does not remove combining diacritical characters |
Previous Message | Tom Lane | 2018-12-18 02:37:01 | Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's |