回复: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?

From: Han Parker <parker(dot)han(at)outlook(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>, "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: 回复: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?
Date: 2020-10-06 01:46:25
Message-ID: ME2PR01MB253270F46FFF2668F396BEEB8A0C0@ME2PR01MB2532.ausprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks for your reply.

Regards,

Parker Han

________________________________
发件人: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
发送时间: 2020年10月5日 14:30
收件人: Han Parker <parker(dot)han(at)outlook(dot)com>
抄送: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>; pgsql-general(at)postgresql(dot)org <pgsql-general(at)postgresql(dot)org>
主题: Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?

Han Parker <parker(dot)han(at)outlook(dot)com> writes:
> ·¢¼þÈË: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
>> Moving GB18030 to server side encoding requires a technical challenge:
>> currently PostgreSQL's SQL parser and perhaps in other parts of
>> backend assume that each byte in a string data is not confused with
>> ASCII byte. Since GB18030's second and fourth byte are in range of
>> 0x40 to 0x7e, backend will be confused. How do you resolve the
>> technical challenge exactly?

> I do not have an exact solution proposal yet.
> Maybe an investigation on MySQL's mechanism would be of help.

TBH, even if you came up with a complete patch, we'd probably
reject it as unmaintainable and a security hazard. The problem
is that code may scan a string looking for certain ASCII characters
such as backslash (\), which up to now it's always been able to do
byte-by-byte without fear that non-ASCII characters could confuse it.
To support GB18030 (or other encodings with the same issue, such as
SJIS), every such loop would have to be modified to advance character
by character, thus roughly "p += pg_mblen(p)" instead of "p++".
Anyplace that neglected to do that would have a bug --- one that
could only be exposed by careful testing using GB18030 encoding.
What's more, such bugs could easily be security problems.
Mis-detecting a backslash, for example, could lead to wrong decisions
about where string literals end, allowing SQL-injection exploits.

--From Parker:
Agree that it may not a cheap feature.
1. pg_mblen() etc. would involve long term costs as well as short term refactoring cost.
2. '0x5c' (backslash) appears in the low bytes of a multiply-bytes-character, increases exposure of SQL-injection risk.
--

> Most frequently used 20902 Chinese characters and 984 symbols in GBK is encoded with 2 bytes, which is a subset of GB18030.
> Newly added not so frequently but indeed used characters and symbols in GB18030 use 4 bytes.

Any efficiency argument has to consider processing costs not just
storage costs. As I showed above, catering for GB18030 would make
certain loops substantially slower, so that you might pay in CPU
cycles what you saved on disk space. It doesn't help any that the
extra processing costs would be paid by every Postgres user on the
planet, whether they used GB18030 or not.

--From Parker:
It depends on use scenarios, so leaving this decision making to application developers with a GUC may be an option.
OLTP may pay more processing cost than storage saving, while OLAP may pay less processing cost than storage saving.
---

In short, I think this is very unlikely to happen.

regards, tom lane

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Tatsuo Ishii 2020-10-06 02:15:35 Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?
Previous Message Tom Lane 2020-10-06 00:58:42 Re: 回复: May "PostgreSQL server side GB18030 character set support" reconsidered?