Quick Links

回复: May "PostgreSQL server side GB18030 character set support" reconsidered?

From:	Han Parker <parker(dot)han(at)outlook(dot)com>
To:	Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
Cc:	"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	回复: May "PostgreSQL server side GB18030 character set support" reconsidered?
Date:	2020-10-05 10:08:28
Message-ID:	ME2PR01MB25323BFB2D3BA4AF8EC1040C8A0C0@ME2PR01MB2532.ausprd01.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Thanks for your comments.
My reply inserted into the following section.

________________________________
发件人: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
发送时间: 2020年10月5日 8:41
收件人: parker(dot)han(at)outlook(dot)com <parker(dot)han(at)outlook(dot)com>
抄送: pgsql-general(at)postgresql(dot)org <pgsql-general(at)postgresql(dot)org>
主题: Re: May "PostgreSQL server side GB18030 character set support" reconsidered?

> Hi，
>
> May "GB18030 server side support" deserve reconsidering, after about 15 years later than release of GB18030-2005?
> It may be the one of most green features for PostgreSQL.

Moving GB18030 to server side encoding requires a technical challenge:
currently PostgreSQL's SQL parser and perhaps in other parts of
backend assume that each byte in a string data is not confused with
ASCII byte. Since GB18030's second and fourth byte are in range of
0x40 to 0x7e, backend will be confused. How do you resolve the
technical challenge exactly?

--Parker:
I do not have an exact solution proposal yet.
Maybe an investigation on MySQL's mechanism would be of help.

> 1. In this big data and mobile era, in the country with most population, 50% more disk energy consuming for Chinese characters (UTF-8 usually 3 bytes for a Chinese character, while GB180830 only 2 bytes) is indeed a harm to "Carbon Neutral", along with Polar ice melting.

Really? I thought GB18030 uses up to 4 bytes.
https://en.wikipedia.org/wiki/GB_18030#Encoding

--Parker:
More preciously description should be GB18030 use 2 or 4 bytes for Chinese characters.
It's a bit complicated to explain with only words but easy with help of the following graph.

Most frequently used 20902 Chinese characters and 984 symbols in GBK is encoded with 2 bytes, which is a subset of GB18030.

Newly added not so frequently but indeed used characters and symbols in GB18030 use 4 bytes.
[cid:3c6fdc98-eecc-4ed3-8665-1c29d7c32f2f]

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

In response to

Re: May "PostgreSQL server side GB18030 character set support" reconsidered? at 2020-10-05 08:41:09 from Tatsuo Ishii

Responses

Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered? at 2020-10-05 12:17:48 from Tatsuo Ishii
Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered? at 2020-10-05 14:30:34 from Tom Lane

Browse pgsql-general by date

	From	Date	Subject
Next Message	Guillaume Lelarge	2020-10-05 11:14:39	Re: Cluster and Vacuum Full
Previous Message	Thorsten Schöning	2020-10-05 09:20:20	What's your experience with using Postgres in IoT-contexts?