Re: Patch for bug #12845 (GB18030 encoding)

From: Arjen Nienhuis <a(dot)g(dot)nienhuis(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch for bug #12845 (GB18030 encoding)
Date: 2015-05-19 13:57:02
Message-ID: CAG6W84JJ5jmgPFSqgSufO1XbRjScH4kBrzmj50xHSd_ZaCMh4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>> That's fine when not every code point is used, but it's different for
>> GB18030 where almost all code points are used. Using a plain array
>> saves space and saves a binary search.
>
> Well, it doesn't save any space: if we get rid of the additional linear
> ranges in the lookup table, what remains is 30733 entries requiring about
> 256K, same as (or a bit less than) what you suggest.

We could do both. What about something like this:

static unsigned int utf32_to_gb18030_from_0x0001[1105] = {
/* 0x0 */ 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,
...
static unsigned int utf32_to_gb18030_from_0x2010[1587] = {
/* 0x0 */ 0xa95c, 0x8136a532, 0x8136a533, 0xa843, 0xa1aa, 0xa844,
0xa1ac, 0x8136a534,
...
static unsigned int utf32_to_gb18030_from_0x2E81[28965] = {
/* 0x0 */ 0xfe50, 0x8138fd39, 0x8138fe30, 0xfe54, 0x8138fe31,
0x8138fe32, 0x8138fe33, 0xfe57,
...
static unsigned int utf32_to_gb18030_from_0xE000[2149] = {
/* 0x0 */ 0xaaa1, 0xaaa2, 0xaaa3, 0xaaa4, 0xaaa5, 0xaaa6, 0xaaa7, 0xaaa8,
...
static unsigned int utf32_to_gb18030_from_0xF92C[254] = {
/* 0x0 */ 0xfd9c, 0x84308535, 0x84308536, 0x84308537, 0x84308538,
0x84308539, 0x84308630, 0x84308631,
...
static unsigned int utf32_to_gb18030_from_0xFE30[464] = {
/* 0x0 */ 0xa955, 0xa6f2, 0x84318538, 0xa6f4, 0xa6f5, 0xa6e0, 0xa6e1, 0xa6f0,
...

static uint32
conv_utf8_to_18030(uint32 code)
{
uint32 ucs = utf8word_to_unicode(code);

#define conv_lin(minunicode, maxunicode, mincode) \
if (ucs >= minunicode && ucs <= maxunicode) \
return gb_unlinear(ucs - minunicode + gb_linear(mincode))

#define conv_array(minunicode, maxunicode) \
if (ucs >= minunicode && ucs <= maxunicode) \
return utf32_to_gb18030_from_##minunicode[ucs - minunicode];

conv_array(0x0001, 0x0452);
conv_lin(0x0452, 0x200F, 0x8130D330);
conv_array(0x2010, 0x2643);
conv_lin(0x2643, 0x2E80, 0x8137A839);
conv_array(0x2E81, 0x9FA6);
conv_lin(0x9FA6, 0xD7FF, 0x82358F33);
conv_array(0xE000, 0xE865);
conv_lin(0xE865, 0xF92B, 0x8336D030);
conv_array(0xF92C, 0xFA2A);
conv_lin(0xFA2A, 0xFE2F, 0x84309C38);
conv_array(0xFE30, 0x10000);
conv_lin(0x10000, 0x10FFFF, 0x90308130);
/* No mapping exists */
return 0;
}

>
> The point about possibly being able to do this with a simple lookup table
> instead of binary search is valid, but I still say it's a mistake to
> suppose that we should consider that only for GB18030. With the reduced
> table size, the GB18030 conversion tables are not all that far out of line
> with the other Far Eastern conversions:
>
> $ size utf8*.so | sort -n
> text data bss dec hex filename
> 1880 512 16 2408 968 utf8_and_ascii.so
> 2394 528 16 2938 b7a utf8_and_iso8859_1.so
> 6674 512 16 7202 1c22 utf8_and_cyrillic.so
> 24318 904 16 25238 6296 utf8_and_win.so
> 28750 968 16 29734 7426 utf8_and_iso8859.so
> 121110 512 16 121638 1db26 utf8_and_euc_cn.so
> 123458 512 16 123986 1e452 utf8_and_sjis.so
> 133606 512 16 134134 20bf6 utf8_and_euc_kr.so
> 185014 512 16 185542 2d4c6 utf8_and_sjis2004.so
> 185522 512 16 186050 2d6c2 utf8_and_euc2004.so
> 212950 512 16 213478 341e6 utf8_and_euc_jp.so
> 221394 512 16 221922 362e2 utf8_and_big5.so
> 274772 512 16 275300 43364 utf8_and_johab.so
> 277776 512 16 278304 43f20 utf8_and_uhc.so
> 332262 512 16 332790 513f6 utf8_and_euc_tw.so
> 350640 512 16 351168 55bc0 utf8_and_gbk.so
> 496680 512 16 497208 79638 utf8_and_gb18030.so
>
> If we were to get excited about reducing the conversion time for GB18030,
> it would clearly make sense to use similar infrastructure for GBK, and
> perhaps the EUC encodings too.

I'll check them as well. If they have linear ranges it should work.

>
> However, I'm not that excited about changing it. We have not heard field
> complaints about these converters being too slow. What's more, there
> doesn't seem to be any practical way to apply the same idea to the other
> conversion direction, which means if you do feel there's a speed problem
> this would only halfway fix it.

It does work if you linearlize it first. That's why we need to convert
to utf32 first as well. That's a form of linearization.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Cramer 2015-05-19 14:07:34 Re: Problems with question marks in operators (JDBC, ECPG, ...)
Previous Message Robert Haas 2015-05-19 13:52:41 Re: Run pgindent now?