Quick Links

Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	egashira(dot)yusuke(at)fujitsu(dot)com
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject:	Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
Date:	2022-09-13 03:47:44
Message-ID:	20220913.124744.990154441593340559.horikyota.ntt@gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

At Fri, 9 Sep 2022 12:22:33 +0000, "egashira(dot)yusuke(at)fujitsu(dot)com" <egashira(dot)yusuke(at)fujitsu(dot)com> wrote in
> However, I still think it is problem that PostgreSQL returns some characters
> which not able to be used in some Windows environment.
> Would it be a reasonable solution to this problem to have the user create
> a map file with the conversion rules changed and add the conversion by
> CREATE CONVERSION ?

The best way nowadays would be to move the entire system to unicode.
Or doesn't it work to let the .Net application to convert UTF-8 into
SJIS locally?

> > AFAIK generally Shift_jis and CP932 have different character sets. I
> > don't know about .Net but doesn't CP932 work in that case?
> > Specifically, "Encoding.GetEncoding(932)". There must a way to deal
> > with that characters since they are in CP932.
>
> Unfortunately, "shift_jis" is the name of "CP932" in .NET[1], so the same
> exception occurs for "Encoding.GetEncoding(932)".

Wow.. MS uses shift_jis as mere an alias of its variant of CP932 [1]
(MS932?).. It's not Shift_JIS nor even CP932 (at least in the decoding
direction)..

> I think the behavior of Windows obscures the problem of different
> conversion rules. I attached the text file extracted 2-byte characters from
> CP932.TXT[3]. When this is displayed using notepad.exe or
> type command on cmd.exe, all characters are displayed in readable form.
> However, when we save the duplicate definition characters displayed
> on notepad to a file, they are implicitly converted to
> "Microsoft Recommended Code Points". So, my problem is probably
> a corner case.

.Net seems less robust than notepad.exe. I didn't find a way to
create a custom encoding on .Net framework. (But I don't think that
is the way to go.)

> My customer used a third-party text editor instead of notepad and claimed
> that some duplicate definition characters could not be displayed.
> Npgsql works with the client_encoding=utf8 setting by default, however,
> there was a customer who wanted to use client_encoding=sjis, and
> the encoding problem came to light.

Ah.. It's nowadays seldom seen, especially about use of level-3 or
more rarely-used characters. Anyway I don't see a reason for utf8 not
being usable as wire-encoding. If the combination of .Net unicode
decoder and sjis encoder works, wouldn't that problem be gone? I
believe .Net sjis encoder must yield the desired result.

Goole showed me some complaints about .Npgsql and SJIS, but many of
them came from the default encodig being not utf8 at that time and
some of them are about broken error messages in an unexpected
encoding..

> Of course, both cases can be treat as third-party editors or .NET issues.
> However, I thought that this might be a bug because those problems would
> not have occurred if PostgreSQL convert the characters via the Microsoft's
> recommended conversion rules, and the reason of the current PostgreSQL
> conversion rules was not clear.
> At least if the reason of the PostgreSQL's current conversion rule is clear,
> it will help us to explain to the users.

The cause of the trouble is that the .Net's specific implement of
CP932 decoder is not actually following CP932; it doesn't accept some
valid characters. The editor does the same, too. So the correct
measure for this situation seems like to convert the texts (in SJIS or
UTF-8, as mentioned above) into the special encoding following the
MS's recommendation no longer available for some reason.

As mentioned before, one of the reasons for the current PostgreSQL's
SJIS mapping is that the precedence between duplicate characters is
not defined in the standard, in other words, it is implementation
dependent. Thus it is valid to arbitrarily define the mapping as far
as it covers all characters. It was more than a decade ago so I don't
know the principle for the mapping, though. But it seems like putting
precedence to characters in the IBM extension area.

The reason we don't change it is it's now sufficiently legacy and the
lack of complaint until it became legacy despite of (I believe) a
certain amount of use cases. It's being leagcy suggests there may be
use cases where that conversion is expected.

[1] (Japanese doc) https://docs.microsoft.com/ja-jp/dotnet/api/system.text.encoding?view=net-6.0

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

RE: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows at 2022-09-09 12:22:33 from egashira.yusuke@fujitsu.com

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	James Pang (chaolpan)	2022-09-13 08:50:04	RE: huge memory of Postgresql backend process
Previous Message	Richard Guo	2022-09-13 02:28:42	Re: foreign join error "variable not found in subplan target list"