Re: Patch for bug #12845 (GB18030 encoding)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch for bug #12845 (GB18030 encoding)
Date: 2015-05-15 19:18:26
Message-ID: 22735.1431717506@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com> writes:
> On Fri, May 15, 2015 at 4:10 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> According to that, about half of the characters below U+FFFF can be
>> processed via linear conversions, so I think we ought to save table
>> space by doing that. However, the remaining stuff that has to be
>> processed by lookup still contains a pretty substantial number of
>> characters that map to 4-byte GB18030 characters, so I don't think
>> we can get any table size savings by adopting a bespoke table format.
>> We might as well use UtfToLocal. (Worth noting in this connection
>> is that we haven't seen fit to sweat about UtfToLocal's use of 4-byte
>> table entries for other encodings, even though most of the others
>> are not concerned with characters outside the BMP.)

> It's not about 4 vs 2 bytes, it's about using 8 bytes vs 4. UtfToLocal
> uses a sparse array:

> map = {{0, x}, {1, y}, {2, z}, ...}

> v.s.

> map = {x, y, z, ...}

> That's fine when not every code point is used, but it's different for
> GB18030 where almost all code points are used. Using a plain array
> saves space and saves a binary search.

Well, it doesn't save any space: if we get rid of the additional linear
ranges in the lookup table, what remains is 30733 entries requiring about
256K, same as (or a bit less than) what you suggest.

The point about possibly being able to do this with a simple lookup table
instead of binary search is valid, but I still say it's a mistake to
suppose that we should consider that only for GB18030. With the reduced
table size, the GB18030 conversion tables are not all that far out of line
with the other Far Eastern conversions:

$ size utf8*.so | sort -n
text data bss dec hex filename
1880 512 16 2408 968 utf8_and_ascii.so
2394 528 16 2938 b7a utf8_and_iso8859_1.so
6674 512 16 7202 1c22 utf8_and_cyrillic.so
24318 904 16 25238 6296 utf8_and_win.so
28750 968 16 29734 7426 utf8_and_iso8859.so
121110 512 16 121638 1db26 utf8_and_euc_cn.so
123458 512 16 123986 1e452 utf8_and_sjis.so
133606 512 16 134134 20bf6 utf8_and_euc_kr.so
185014 512 16 185542 2d4c6 utf8_and_sjis2004.so
185522 512 16 186050 2d6c2 utf8_and_euc2004.so
212950 512 16 213478 341e6 utf8_and_euc_jp.so
221394 512 16 221922 362e2 utf8_and_big5.so
274772 512 16 275300 43364 utf8_and_johab.so
277776 512 16 278304 43f20 utf8_and_uhc.so
332262 512 16 332790 513f6 utf8_and_euc_tw.so
350640 512 16 351168 55bc0 utf8_and_gbk.so
496680 512 16 497208 79638 utf8_and_gb18030.so

If we were to get excited about reducing the conversion time for GB18030,
it would clearly make sense to use similar infrastructure for GBK, and
perhaps the EUC encodings too.

However, I'm not that excited about changing it. We have not heard field
complaints about these converters being too slow. What's more, there
doesn't seem to be any practical way to apply the same idea to the other
conversion direction, which means if you do feel there's a speed problem
this would only halfway fix it.

So my feeling is that the most practical and maintainable answer is to
keep GB18030 using code that is mostly shared with the other encodings.
I've committed a fix that does it that way for 9.5. If you want to
pursue the idea of a faster conversion using direct lookup tables,
I think that would be 9.6 material at this point.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2015-05-15 19:32:58 Re: Triaging the remaining open commitfest items
Previous Message Bruno Harbulot 2015-05-15 19:14:27 Problems with question marks in operators (JDBC, ECPG, ...)