From: | "egashira(dot)yusuke(at)fujitsu(dot)com" <egashira(dot)yusuke(at)fujitsu(dot)com> |
---|---|
To: | 'Kyotaro Horiguchi' <horikyota(dot)ntt(at)gmail(dot)com>, "tgl(at)sss(dot)pgh(dot)pa(dot)us" <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org> |
Subject: | RE: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows |
Date: | 2022-09-09 12:22:33 |
Message-ID: | TYWPR01MB72020716C9B9B1EBF406C9ADFF439@TYWPR01MB7202.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hi,
Thank you replying.
I understand that the difference between these conversion rules is not bug.
Unfortunately, given the many variations of SJIS, we also realized that
matching a translation table to Microsoft's "recommended" rule would
likely cause other problem to happen.
I have been looking for original documents such as kb170559 that Microsoft
changed the URL or stopped publishing, but I also couldn't find them...
Therefore, I agree with you that PostgreSQL should not suddenly change
its conversion rules.
However, I still think it is problem that PostgreSQL returns some characters
which not able to be used in some Windows environment.
Would it be a reasonable solution to this problem to have the user create
a map file with the conversion rules changed and add the conversion by
CREATE CONVERSION ?
> AFAIK generally Shift_jis and CP932 have different character sets. I
> don't know about .Net but doesn't CP932 work in that case?
> Specifically, "Encoding.GetEncoding(932)". There must a way to deal
> with that characters since they are in CP932.
Unfortunately, "shift_jis" is the name of "CP932" in .NET[1], so the same
exception occurs for "Encoding.GetEncoding(932)".
> FYI, the following range of SJIS character codes are *excluded* while
> unicode->sjis conversion. They are not only NEC/IBM extension
> characters.
>
> ed40 - eefc : so-called "NEC extension"
> uses fa40 - fc40 (IBM extension) instead.
> 8754 - 875d : numbers with circle, and upper roman numbers
> uses fa4a - fa53 instead.
> 878a, 8782, 8784, fa5b, fa54: some japanese combined characters "No." "(株)"...
> uses fa58, fa59, fa5a, 81e6, 879a, 81ca
> 8790 - 8792 : math symbols, uses 81e0, 81df, 81e7
> 8795 - 8797 : ditto, 81e3, 81db, 81da
> 879a - 879c : ditto, 879a, 81bf, 81be
Yes, I understand this exclude rules describes the conversion rule for SJIS
duplicated characters in PostgreSQL. In my understanding, characters
related to duplicate characters included in SJIS are as follows[2].
- NEC special characters(Row 13) : 8740 - 879c
- NEC selected-IBM extended characters(Row 89 - 92) : ed40 - eefc
- IBM selected characters(Row 115 - 119) : fa40 - fc4b
The excluding rule of PostgreSQL seems to be match the Microsoft's
recommended rule except for "NEC special characters (Row 13) and
IBM selected characters(Row 115 - 119)" rule.
> At Thu, 08 Sep 2022 22:58:31 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in
> > Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> writes:
> > > This is not a bug, but the designed behavior. But we could change that
> > > conversion table if a plausible reasoning is raised.
> >
> > Given how long our current behavior has stood, I think it'd have to be
> > a pretty convincing argument. As you say, there'd be some serious
> > compatibility costs to changing that behavior.
> >
> > IIUC, SJIS<->Unicode conversions have always been a squishy thing
> > because of inconsistencies between the various versions of "SJIS".
> > I'm not seeing a good reason we should regard Windows' behavior as
> > authoritative here.
> >
> > I'm not saying I can't be convinced, but "Microsoft does it that
> > way" isn't enough to convince me.
>
> Yeah, it is more or less I meant. And I suspect that the problem that
> his customers are complaining is not caused by our specific conversion
> table.
I think the behavior of Windows obscures the problem of different
conversion rules. I attached the text file extracted 2-byte characters from
CP932.TXT[3]. When this is displayed using notepad.exe or
type command on cmd.exe, all characters are displayed in readable form.
However, when we save the duplicate definition characters displayed
on notepad to a file, they are implicitly converted to
"Microsoft Recommended Code Points". So, my problem is probably
a corner case.
My customer used a third-party text editor instead of notepad and claimed
that some duplicate definition characters could not be displayed.
Npgsql works with the client_encoding=utf8 setting by default, however,
there was a customer who wanted to use client_encoding=sjis, and
the encoding problem came to light.
Of course, both cases can be treat as third-party editors or .NET issues.
However, I thought that this might be a bug because those problems would
not have occurred if PostgreSQL convert the characters via the Microsoft's
recommended conversion rules, and the reason of the current PostgreSQL
conversion rules was not clear.
At least if the reason of the PostgreSQL's current conversion rule is clear,
it will help us to explain to the users.
[1] https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=net-6.0#list-of-encodings
[2] https://en.wikipedia.org/wiki/Code_page_932_(Microsoft_Windows)#Double-byte_character_differences
[3] http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
Regards.
Yusuke, Egashira
Attachment | Content-Type | Size |
---|---|---|
2byte_CP932_characters.txt | text/plain | 22.6 KB |
2byte_CP932_characters_notepaded.txt | text/plain | 22.6 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Masahiko Sawada | 2022-09-09 21:48:54 | Re: Excessive number of replication slots for 12->14 logical replication |
Previous Message | Amit Langote | 2022-09-09 08:41:51 | Re: huge memory of Postgresql backend process |