From: | Amit Langote <amitlangote09(at)gmail(dot)com> |
---|---|
To: | Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8 |
Date: | 2020-10-30 03:08:51 |
Message-ID: | CA+HiwqGwTDgFBjzVo+TjQMCBWfs-NCZi_FXmgZPypwfMgiE9OQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Oct 30, 2020 at 9:44 AM Ashutosh Sharma <ashu(dot)coek88(at)gmail(dot)com> wrote:
>
> Hi All,
>
> Today while working on some other task related to database encoding, I
> noticed that the MINUS SIGN (with byte sequence a1-dd) in EUC-JP is
> mapped to FULLWIDTH HYPHEN-MINUS (with byte sequence ef-bc-8d) in
> UTF-8. See below:
>
> postgres=# select convert('\xa1dd', 'euc_jp', 'utf8');
> convert
> ----------
> \xefbc8d
> (1 row)
>
> Isn't this a bug? Shouldn't this have been converted to the MINUS SIGN
> (with byte sequence e2-88-92) in UTF-8 instead of FULLWIDTH
> HYPHEN-MINUS SIGN.
>
> When the MINUS SIGN (with byte sequence e2-88-92) in UTF-8 is
> converted to EUC-JP, the convert functions fails with an error saying:
> "character with byte sequence 0xe2 0x88 0x92 in encoding UTF8 has no
> equivalent in encoding EUC_JP". See below:
>
> postgres=# select convert('\xe28892', 'utf-8', 'euc_jp');
> ERROR: character with byte sequence 0xe2 0x88 0x92 in encoding "UTF8"
> has no equivalent in encoding "EUC_JP"
>
> However, when the same MINUS SIGN in UTF-8 is converted to SJIS
> encoding, the convert function returns the correct result. See below:
>
> postgres=# select convert('\xe28892', 'utf-8', 'sjis');
> convert
> ---------
> \x817c
> (1 row)
>
> Please note that the byte sequence (81-7c) in SJIS represents MINUS
> SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> MINUS SIGN in SJIS and that is what we expect. Isn't it?
So we have
a1dd in euc_jp,
817c in sjis,
efbc8d in utf-8
that convert between each other just fine.
But when it comes to
e28892 in utf-8
it currently only converts to sjis and that too just one way:
select convert('\xe28892', 'utf-8', 'sjis');
convert
---------
\x817c
(1 row)
select convert('\x817c', 'sjis', 'utf-8');
convert
----------
\xefbc8d
(1 row)
I noticed that the commit a8bd7e1c6e02 from ages ago removed
conversions from and to utf-8's e28892, in favor of efbc8d, and that
change has stuck. (Note though that these maps looked pretty
different back then.)
--- a/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
+++ b/src/backend/utils/mb/Unicode/euc_jp_to_utf8.map
- {0xa1dd, 0xe28892},
+ {0xa1dd, 0xefbc8d},
--- a/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
+++ b/src/backend/utils/mb/Unicode/utf8_to_euc_jp.map
- {0xe28892, 0xa1dd},
+ {0xefbc8d, 0xa1dd},
Can't tell what reason there was to do that, but there must have been
some. Maybe the Japanese character sets prefer full-width hyphen
minus (unicode U+FF0D) over mathematical minus sign (U+2212)?
--
Amit Langote
EDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Kyotaro Horiguchi | 2020-10-30 03:19:50 | Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8 |
Previous Message | Fujii Masao | 2020-10-30 03:00:27 | Re: Add statistics to pg_stat_wal view for wal related parameter tuning |