From: | Albe Laurenz <laurenz(dot)albe(at)wien(dot)gv(dot)at> |
---|---|
To: | "'Tom Lane *EXTERN*'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgreSQL(dot)org" <pgsql-hackers(at)postgreSQL(dot)org> |
Cc: | Tatsuo Ishii <ishii(at)postgreSQL(dot)org> |
Subject: | Re: Errors in our encoding conversion tables |
Date: | 2015-11-27 08:49:37 |
Message-ID: | A737B7A37273E048B164557ADEF4A58B50FECB63@ntex2010i.host.magwien.gv.at |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Tom Lane wrote:
> There's a discussion over at
> http://www.postgresql.org/message-id/flat/2sa(dot)Dhu5(dot)1hk1yrpTNFy(dot)1MLOlb(at)seznam(dot)cz
> of an apparent error in our WIN1250 -> LATIN2 conversion. I looked into this
> and found that indeed, the code will happily translate certain characters
> for which there seems to be no justification. I made up a quick script
> that would recompute the conversion tables in latin2_and_win1250.c from
> the Unicode mapping files in src/backend/utils/mb/Unicode, and what it
> computes is shown in the attached diff. (Zeroes in the tables indicate
> codes with no translation, for which an error should be thrown.)
>
> Having done that, I thought it would be a good idea to see if we had any
> other conversion tables that weren't directly based on the Unicode data.
> The only ones I could find were in cyrillic_and_mic.c, and those seem to
> be absolutely filled with errors, to the point where I wonder if they were
> made from the claimed encodings or some other ones. The attached patch
> recomputes those from the Unicode data, too.
>
> None of this data seems to have been touched since Tatsuo-san's original
> commit 969e0246, so it looks like we simply didn't vet that submission
> closely enough.
>
> I have not attempted to reverify the files in utils/mb/Unicode against the
> original Unicode Consortium data, but maybe we ought to do that before
> taking any further steps here.
>
> Anyway, what are we going to do about this? I'm concerned that simply
> shoving in corrections may cause problems for users. Almost certainly,
> we should not back-patch this kind of change.
Thanks for picking this up.
I agree with your proposed fix, the only thing that makes me feel uncomfortable
is that you get error messages like:
ERROR: character with byte sequence 0x96 in encoding "WIN1250" has no equivalent in encoding "MULE_INTERNAL"
which is a bit misleading.
But the main thing is that no corrupt data can be entered.
I can understand the reluctance to back-patch; nobody likes his
application to suddenly fail after a minor database upgrade.
However, the people who would fail if this were back-patched are
people who will certainly run into trouble if they
a) upgrade to a release where this is fixed or
b) try to convert their database to, say, UTF8.
The least thing we should do is stick a fat warning into the release notes
of the first version where this is fixed, along with some guidelines what
to do (though I am afraid that there is not much more helpful to say than
"If your database encoding is X and data have been entered with client_encoding Y,
fix your data in the old system").
But I think that this fix should be applied to 9.6.
PostgreSQL has a strong reputation for being strict about correct encoding
(not saying that everybody appreciates that), and I think we shouldn't mar
that reputation.
Yours,
Laurenz Albe
From | Date | Subject | |
---|---|---|---|
Next Message | Ashutosh Bapat | 2015-11-27 09:32:10 | Re: Getting sorted data from foreign server for merge join |
Previous Message | Michael Paquier | 2015-11-27 07:59:20 | Re: Error with index on unlogged table |