From: | Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com> |
---|---|
To: | hlinnaka(at)iki(dot)fi |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF |
Date: | 2015-03-10 22:21:24 |
Message-ID: | CAG6W84JZ-ZFhAM1GQzpVUOW8YM2gx6_-f4uCKU1j2sdmt+wO6g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On 10 Mar 2015 22:33, "Heikki Linnakangas" <hlinnaka(at)iki(dot)fi> wrote:
>
> On 03/09/2015 10:51 PM, a(dot)g(dot)nienhuis(at)gmail(dot)com wrote:
>>
>> The following bug has been logged on the website:
>>
>> Bug reference: 12845
>> Logged by: Arjen Nienhuis
>> Email address: a(dot)g(dot)nienhuis(at)gmail(dot)com
>> PostgreSQL version: 9.3.5
>> Operating system: Ubuntu Linux
>> Description:
>>
>> Step to reproduce:
>>
>> In psql:
>>
>> arjen=> select convert_to(chr(128512), 'GB18030');
>>
>> Actual output:
>>
>> ERROR: character with byte sequence 0xf0 0x9f 0x98 0x80 in encoding
"UTF8"
>> has no equivalent in encoding "GB18030"
>>
>> Expected output:
>>
>> convert_to
>> ------------
>> \x9439fc36
>> (1 row)
>
>
> Hmm, looks like our gb18030 <-> Unicode conversion table only contains
the Unicode BMP plane. Unicode points above 0xffff are not included.
>
> If we added all the missing mappings as one to one mappings, like we've
done for the BMP, that would bloat the table horribly. There are over 1
million code points that are currently not mapped. Fortunately, the missing
mappings are in linear ranges that would be fairly simple to handle in
programmatically. See e.g.
https://ssl.icu-project.org/repos/icu/data/trunk/charset/source/gb18030/gb18030.html.
Someone needs to write the code (I'm not volunteering myself).
>
> - Heikki
I can write a "uint32 UTF8toGB18030(uint32)" function, but I don't know
where to put it in the code.
(Maybe at line 479 of conv.c:
https://github.com/postgres/postgres/blob/4baaf863eca5412e07a8441b3b7e7482b7a8b21a/src/backend/utils/mb/conv.c#L479
)
Else I could also extend the map file. It would double in size if it only
needs to include valid code points.
From | Date | Subject | |
---|---|---|---|
Next Message | Heikki Linnakangas | 2015-03-10 22:33:43 | Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF |
Previous Message | Heikki Linnakangas | 2015-03-10 21:33:47 | Re: BUG #12845: The GB18030 encoding doesn't support Unicode characters over 0xFFFF |