Re: Errors in our encoding conversion tables

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors in our encoding conversion tables
Date: 2015-11-27 04:30:53
Message-ID: 25721.1448598653@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
> I have started to looking into it. I wonder how do you create the part
> of your patch:

The code I used is below.

> In the above you seem to disable the conversion from 0x96 of win1250
> to ISO-8859-2 by using the Unicode mapping files in
> src/backend/utils/mb/Unicode. But the corresponding mapping file
> (iso8859_2_to_utf8.amp) does include following entry:

> {0x0096, 0xc296},

> How do you know 0x96 should be removed from the conversion?

Right, but there is no mapping in the win1250-utf8 files that matches
U+C296. The complaint over in the other thread is precisely that we
have no business translating 0x96 in WIN1250 to this character. What
WIN1250 0x96 could translate to is U+E28093 (at least, according to
win1250_to_utf8.map) but that Unicode character has no equivalent in
LATIN2.

AFAICS, whoever made these tables just arbitrarily decided that 0x96
in WIN1250 could be mapped to 0x96 in LATIN2, and likewise for a number
of other codes; but those are false equivalences, as you find out if
you try to perform the same conversion via other encoding conversion
paths, ie convert to UTF8 and then to the other encoding.

regards, tom lane

Attachment Content-Type Size
buildmaps.c text/x-c 2.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2015-11-27 04:39:35 Re: WIP: About CMake v2
Previous Message XiaoChuan Yu 2015-11-27 02:35:48 How to add and use a static library within Postgres backend