From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Patch for bug #12845 (GB18030 encoding) |
Date: | 2015-05-15 14:10:18 |
Message-ID: | 19727.1431699018@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com> writes:
> GB18030 is a special case, because it's a full mapping of all unicode
> characters, and most of it is algorithmically defined.
True.
> This makes UtfToLocal a bad choice to implement it.
I disagree with that conclusion. There are still 30000+ characters
that need to be translated via lookup table, so we still need either
UtfToLocal or a clone of it; and as I said previously, I'm not on board
with cloning it.
> I think the best solution is to get rid of UtfToLocal for GB18030. Use
> a specialized algorithm:
> - For characters > U+FFFF use the algorithm from my patch
> - For charcaters <= U+FFFF use special mapping tables to map from/to
> UTF32. Those tables would be smaller, and the code would be faster (I
> assume).
I looked at what wikipeda claims is the authoritative conversion table:
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/gb-18030-2000.xml
According to that, about half of the characters below U+FFFF can be
processed via linear conversions, so I think we ought to save table
space by doing that. However, the remaining stuff that has to be
processed by lookup still contains a pretty substantial number of
characters that map to 4-byte GB18030 characters, so I don't think
we can get any table size savings by adopting a bespoke table format.
We might as well use UtfToLocal. (Worth noting in this connection
is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte
table entries for other encodings, even though most of the others
are not concerned with characters outside the BMP.)
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2015-05-15 14:42:16 | Re: Changes to backup.sgml |
Previous Message | Tom Lane | 2015-05-15 13:44:21 | Re: best place for "rtree" strategy numbers |