From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | amitlangote09(at)gmail(dot)com |
Cc: | ashu(dot)coek88(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8 |
Date: | 2020-10-30 07:56:38 |
Message-ID: | 20201030.165638.1664587537743852598.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Fri, 30 Oct 2020 16:33:01 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> At Fri, 30 Oct 2020 14:38:30 +0900, Amit Langote <amitlangote09(at)gmail(dot)com> wrote in
> I'm not sure how we should construct our won mapping, but the
> difference made by we simply moved to JIS0208.TXT based as Ishii-san
> suggested the differences in the mapping would be as the follows.
Mmm..
I'm not sure how we should construct our won mapping, but the
difference made by simply moving to JIS0208.TXT-based as Ishii-san
suggested, the following differences would be seen in the mappings.
> 1. The following codes (regions) are not defined in JIS0208.
>
> 8ea1 - 8edf (up to 64 characters (I didn't actually counted them.))
> ada1 - adfc (up to 92 characters (ditto))
> 8ff3f3 - 8ff4a8 (up to 182 characters (ditto))
8ea1 - 8edf (64 chars. U+ff61 - U+ff9f) (hankaku-kana)
ada1 - adfc (83 chars, U+2460 - U+33a1) (numbers with cicle)
8ff3f3 - 8ff4a8 (20 chars, U+2160 - U+2179) (roman numerals)
> a1c0 ff3c: (ff3c: FULLWIDTH REVERSE SOLIDUS)
> 8ff4aa ff07: (ff07: FULLWIDTH APOSTROPHE)
>
> 2. some individual differences
>
> EUC 0208 932
> a1c1 301c ff5e: (301c:WAVE DASH)
> a1c2 2016 2225: (2016:DOUBLE_VERTICAL LINE) : (2225:PARALLEL TO)
> * a1dd 2212 ff0d: (2212: MINUS_SIGN) : (ff0d: FULLWIDTH HYPHEN-MINUS)
> d1f1 a2 ffe0: (00a2: CENT SIGN) : (ffe0: FULLWIDTH CENT SIGN)
> d1f2 a3 ffe1: (00a3: PUND SIGN) : (ffe1: FULLWIDTH POUND SIGN)
> a2cc ac ffe2: (00ac: NOT SIGN) : (ffe2: FULLWIDTH NOT SIGN)
>
>
> *1: https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT
>
> > > > Please note that the byte sequence (81-7c) in SJIS represents MINUS
> > > > SIGN in SJIS which means the MINUS SIGN in UTF8 got converted to the
> > > > MINUS SIGN in SJIS and that is what we expect. Isn't it?
> > >
> > > I think we don't change authoritative mappings, but maybe can add some
> > > one-way conversions for the convenience.
> >
> > Maybe UCS_TO_EUC_JP.pl could do something like the above.
> >
> > Are there other cases that were fixed like this in the past, either
> > for euc_jp or sjis?
>
> Honestly, I don't know how the mapping was decided in 2002, but
> removing the regions in 1 would cause confusion. So what we can do in
> this area would be chaning some of 2 to 0208 mapping. But arbitrary
> mixture of different mapings would cause new problem..
Forgot about adding one-way mappings. I think we can add several
such mappings, say.
U+3031->: EUC:a1c1 <-> U+ff5e
U+2016->: EUC:a1c2 <-> U+2225
U+2212->: EUC:a1dd <-> U+ff0d
U+00a2->: EUC:d1f1 <-> U+ffe0
U+00a3->: EUC:d1f2 <-> U+ffe1
U+00ac->: EUC:a2cc <-> U+ffe2
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Smith | 2020-10-30 08:26:18 | Re: [HACKERS] logical decoding of two-phase transactions |
Previous Message | Kyotaro Horiguchi | 2020-10-30 07:33:01 | Re: MINUS SIGN (U+2212) in EUC-JP encoding is mapped to FULLWIDTH HYPHEN-MINUS (U+FF0D) in UTF-8 |