Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: egashira(dot)yusuke(at)fujitsu(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
Date: 2022-09-09 02:42:16
Message-ID: 20220909.114216.2263659117945873025.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

This is not a bug, but the designed behavior. But we could change that
conversion table if a plausible reasoning is raised.

At Thu, 08 Sep 2022 11:33:17 +0000, PG Bug reporting form <noreply(at)postgresql(dot)org> wrote in
> SJIS(Windows-31J) has several defined characters that has the
> same glyph but a different code point for it. The SJIS conversion
> rules in PostgreSQL's client_encoding seem to be slightly different
> from the rules in the Windows OS.

PostgreSQL follows CP932. And no rule on the precedence between
duplicate characters is published as a public standard. According to
[2], it is published as Microsoft's recommended convention.

> In some cases, it causes a bad thing for Windows users.
> For example, some text editors can't display these characters, and
> .NET applications raise exceptions when converting SJIS byte
> sequences to UTF16 (String type). This can happen when using Npgsql[1].
>
> .NET code:
> ----
> Encoding e = Encoding.GetEncoding("shift_jis",

AFAIK generally Shift_jis and CP932 have different character sets. I
don't know about .Net but doesn't CP932 work in that case?
Specifically, "Encoding.GetEncoding(932)". There must a way to deal
with that characters since they are in CP932.

> My customers have difficulty dealing with SJIS code in Windows
> applications because of this difference in conversion rules.
> They are migrating from Oracle and many of the applications are
> written for the SJIS environment.
>
> The rules for converting from Unicode to characters that are
> duplicated in SJIS seem to be as follows in Windows[2]:
>
> 1. If the character is in both JIS X 0208 and NEC special characters,
> use the code point of JIS X 0208.
> 2. If the character is in both NEC special characters and IBM selected
> characters, use the code point of NEC special characters.
> 3. If the character is in both IBM selected characters and
> NEC selected-IBM extended characters, use the code point of
> IBM selected characters.

Mmm. I don't reach the original document by Microsoft pointed from
[2]. Could you tell me an alternative URL? (Goole didn't offer usable
info by kb170559 or somethig like)

> However, the rules for converting from Unicode to SJIS in PostgreSQL
> seem to differ from the above second rule.
> SJIS codepoints corresponding to the second rule are listed below:
> - "NEC special characters" : 0x8754 - 0x875D, 0x8782, 0x8784, 0x878A
> - "IBM selected characters": 0xFA4A - 0xFA53, 0xFA59, 0xFA5A, 0xFA58
>
> In src/backend/utils/mb/Unicode/UCS_to_SJIS.pl, @reject_sjis array
> defines the not used code points when converting Unicode to SJIS.
> According to the second rule above, the @reject_sjis array must contain
> "IBM selected characters", but it currently contains "NEC special
> characters".

Anyway it is not in the public standard and at most that "rule" is a
recommendation. So it's not the case we "must" change the conversion
table following the "rule".

FYI, the following range of SJIS character codes are *excluded* while
unicode->sjis conversion. They are not only NEC/IBM extension
characters.

ed40 - eefc : so-called "NEC extension"
uses fa40 - fc40 (IBM extension) instead.
8754 - 875d : numbers with circle, and upper roman numbers
uses fa4a - fa53 instead.
878a, 8782, 8784, fa5b, fa54: some japanese combined characters "No." "(株)"...
uses fa58, fa59, fa5a, 81e6, 879a, 81ca
8790 - 8792 : math symbols, uses 81e0, 81df, 81e7
8795 - 8797 : ditto, 81e3, 81db, 81da
879a - 879c : ditto, 879a, 81bf, 81be

> The current PostgreSQL rules for converting duplicate definition characters
>
> seems to be introduced by 5735c4cf3d059914e2b9d294203aa06fb2c4ac75,
> back in 2001, but I could not be found reason for it in past mailing list
> logs.
> I think this conversion difference is a bug,
> but is it a rule with some clear reason?

I don't know about a clear rason for the current conversion, but it is
a reason for *not* changing the conversion table that we had no
complaint about the conversion for more than ten years. Because
changing that tables could cause problems elsewhere.

> [1] https://www.npgsql.org/
> [2] https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-cp932.html

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2022-09-09 02:58:31 Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
Previous Message Japin Li 2022-09-09 01:07:04 Re: BUG #17610: Use of multiple composite types incompatible with record-typed function parameter