BUG #13785: Postgresql encoding screw-up

From: ntpt(at)seznam(dot)cz
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #13785: Postgresql encoding screw-up
Date: 2015-11-26 12:15:58
Message-ID: 20151126121558.2760.44255@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 13785
Logged by: ntpt
Email address: ntpt(at)seznam(dot)cz
PostgreSQL version: 9.4.5
Operating system: linux
Description:

Hi, all

I use postgresql several years. But recently with hw upgrade a fall to
problem.

pg_exec(): Query failed: ERROR: character with byte sequence 0xc2 0x96 in
encoding "UTF8" has no equivalent in encoding "WIN1250"
It is a strange. First there was a database with latin2 encoding.
to this database connect an aplicaton with "set client encoding to win1250"
and manipulating data
then database was dumped with pg_dump -E UTF8
then database was restored pg_restore on another cluster in database with
UTF8 encoding
then application connect to new database with "set client encoding to
win1250"
and - query failed
How in this scenario could invaid characters reach the database ???

Look here at whole thread
http://postgresql.nabble.com/Query-failed-ERROR-character-with-byte-sequence-0xc2-0x96-in-encoding-quot-UTF8-quot-has-no-equivale-td5875048.html

I examine this situation. I am affraid that there is a major design flaw or
bug that can screw up a lot of databases.

Please look at the https://cs.wikipedia.org/wiki/Windows-1250 (in czech
only) but translation table "Srovnání s ISO 8859-2" is selfexplanatory.

In situatin where db have latin2 encoding but client set client encoding to
win1250 and manipulate data: Because some characters in cp1250 (win1250)
are not present in latin2 encoding, postgres store value of that chars "as
is" ie without charset translation ie as example 0x96 (EN_DASH in win1250)
is stored as 0x96 (unknown in latin2) (at least version 9.0)

If You read from that database with client encoding set to win1250, this "as
is" characters are "as is" returned - and from point of client aplication -
everything work as expected.

But.. If You need to migrate and transcode database to utf8 by recomended
way (pg_dump -E utf8 ) things goes weired

Because there is no character 0x96 in latin2 , transcoder to utf8 does not
know the recipe how treat this chracter - and leave it "as is" producing
\u0096 character in output .
But if You look to the table "Mapování do Unikódu" from wiki above mentioned
, proper unicode code for this char in WIN1250 is \u2013. (And postgres can
not know it, because postgres does not know from with codepage these stored
"as is" characters came )

So now if You restore it to new utf8 database, Part of characters would
have "unicodized" raw values of that chars that was not presented in
original database encoding, but presented in original client encoding . And
result is an error as described in my post in mailing list

So everybody in situation where need transfer to utf8 and original db
encoding have no representation for some chars in client encodings ended
like this - with working - but screwed and non transferable database.

I thik that safe practice would be: Pg_dum with -E as used by client
applicaton and then restore to newly created utf8 database . It should be
mentioned as safe way in the doc, at least

But in enviroment, where client use multiple encodings that have more
characters then database encoding, the database is screwed forever - will
work but can not be repaired and migrated to another encoding. Fix me if i
am wrong..

PS: execuse my bad english

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2015-11-26 16:14:28 Re: Recovery conflict message lost in user session for 9.3
Previous Message Mark Kirkwood 2015-11-26 08:31:13 Recovery conflict message lost in user session for 9.3