From: | Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Patch for bug #12845 (GB18030 encoding) |
Date: | 2015-05-15 15:49:22 |
Message-ID: | CAG6W84J+BJ0hEe1yrPL4bxVz-MaqCFdHkWRWVBiq8BaCoY8j3Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, May 15, 2015 at 4:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com> writes:
>> GB18030 is a special case, because it's a full mapping of all unicode
>> characters, and most of it is algorithmically defined.
>
> True.
>
>> This makes UtfToLocal a bad choice to implement it.
>
> I disagree with that conclusion. There are still 30000+ characters
> that need to be translated via lookup table, so we still need either
> UtfToLocal or a clone of it; and as I said previously, I'm not on board
> with cloning it.
>
>> I think the best solution is to get rid of UtfToLocal for GB18030. Use
>> a specialized algorithm:
>> - For characters > U+FFFF use the algorithm from my patch
>> - For charcaters <= U+FFFF use special mapping tables to map from/to
>> UTF32. Those tables would be smaller, and the code would be faster (I
>> assume).
>
> I looked at what wikipeda claims is the authoritative conversion table:
>
> http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
>
> According to that, about half of the characters below U+FFFF can be
> processed via linear conversions, so I think we ought to save table
> space by doing that. However, the remaining stuff that has to be
> processed by lookup still contains a pretty substantial number of
> characters that map to 4-byte GB18030 characters, so I don't think
> we can get any table size savings by adopting a bespoke table format.
> We might as well use UtfToLocal. (Worth noting in this connection
> is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte
> table entries for other encodings, even though most of the others
> are not concerned with characters outside the BMP.)
>
It's not about 4 vs 2 bytes, it's about using 8 bytes vs 4. UtfToLocal
uses a sparse array:
map = {{0, x}, {1, y}, {2, z}, ...}
v.s.
map = {x, y, z, ...}
That's fine when not every code point is used, but it's different for
GB18030 where almost all code points are used. Using a plain array
saves space and saves a binary search.
Gr. Arjen
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2015-05-15 15:55:58 | Re: i feel like compelled ! |
Previous Message | Bruce Momjian | 2015-05-15 15:24:30 | Re: Changes to backup.sgml |