Quick Links

BUG #17611: SJIS conversion rule about duplicated characters differ from Windows

From:	PG Bug reporting form <noreply(at)postgresql(dot)org>
To:	pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc:	egashira(dot)yusuke(at)fujitsu(dot)com
Subject:	BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
Date:	2022-09-08 11:33:17
Message-ID:	17611-472d27cf395135b7@postgresql.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

The following bug has been logged on the website:

Bug reference: 17611
Logged by: yusuke egashira
Email address: egashira(dot)yusuke(at)fujitsu(dot)com
PostgreSQL version: 12.11
Operating system: RHEL7(Server) and Windows10(Client)
Description:

SJIS(Windows-31J) has several defined characters that has the
same glyph but a different code point for it. The SJIS conversion
rules in PostgreSQL's client_encoding seem to be slightly different
from the rules in the Windows OS.

In some cases, it causes a bad thing for Windows users.
For example, some text editors can't display these characters, and
.NET applications raise exceptions when converting SJIS byte
sequences to UTF16 (String type). This can happen when using Npgsql[1].

.NET code:
----
Encoding e = Encoding.GetEncoding("shift_jis",
EncoderFallback.ExceptionFallback,
DecoderFallback.ExceptionFallback);
var utfString = e.GetString(sjis_byte_sequence);
----

Exception:
----
Exception thrown: 'System.Text.DecoderFallbackException' in mscorlib.dll
An unhandled exception of type 'System.Text.DecoderFallbackException'
occurred in mscorlib.dll
Unable to translate bytes [FA][4A] at index 1632 from specified code page to
Unicode.
----

My customers have difficulty dealing with SJIS code in Windows
applications because of this difference in conversion rules.
They are migrating from Oracle and many of the applications are
written for the SJIS environment.

The rules for converting from Unicode to characters that are
duplicated in SJIS seem to be as follows in Windows[2]:

1. If the character is in both JIS X 0208 and NEC special characters,
use the code point of JIS X 0208.
2. If the character is in both NEC special characters and IBM selected
characters, use the code point of NEC special characters.
3. If the character is in both IBM selected characters and
NEC selected-IBM extended characters, use the code point of
IBM selected characters.

However, the rules for converting from Unicode to SJIS in PostgreSQL
seem to differ from the above second rule.
SJIS codepoints corresponding to the second rule are listed below:
- "NEC special characters" : 0x8754 - 0x875D, 0x8782, 0x8784, 0x878A
- "IBM selected characters": 0xFA4A - 0xFA53, 0xFA59, 0xFA5A, 0xFA58

In src/backend/utils/mb/Unicode/UCS_to_SJIS.pl, @reject_sjis array
defines the not used code points when converting Unicode to SJIS.
According to the second rule above, the @reject_sjis array must contain
"IBM selected characters", but it currently contains "NEC special
characters".

The current PostgreSQL rules for converting duplicate definition characters

seems to be introduced by 5735c4cf3d059914e2b9d294203aa06fb2c4ac75,
back in 2001, but I could not be found reason for it in past mailing list
logs.
I think this conversion difference is a bug,
but is it a rule with some clear reason?

[1] https://www.npgsql.org/
[2] https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-cp932.html

Responses

Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows at 2022-09-09 02:42:16 from Kyotaro Horiguchi

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Tom Lane	2022-09-08 13:47:57	Re: huge memory of Postgresql backend process
Previous Message	Ming	2022-09-08 10:45:15	Re: Postgres offset and limit bug