From: | Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> |
---|---|
To: | Stephane Bortzmeyer <bortzmeyer(at)nic(dot)fr> |
Cc: | Ronald Vyhmeister <rvyhmeister(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org |
Subject: | Re: Problem with langage encoding |
Date: | 2008-12-02 00:34:44 |
Message-ID: | 493482A4.8090409@postnewspapers.com.au |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Stephane Bortzmeyer wrote:
> On Thu, Nov 27, 2008 at 02:34:17AM +0900,
> Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> wrote
> a message of 188 lines which said:
>
>>> ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251"
>> Which it does not; that character is "???" (HANGUL SYLLABLE SSYEG)
>
> No, I don't think so. I think that 0xc3bd is the binary value, and
> 0xc3bd in UTF-8 is U+00FD (LATIN SMALL LETTER Y WITH ACUTE) which is
> less surprising. It is in Latin-1 but not in WIN1251.
Yes, that would make a lot more sense. From there it's easy, see below:
> In KOI8-R, FD is the Shcha, a common letter in Russian, so I wonder if
> the database was really in Latin-1.
OK, so what we end up with is the following sequence (expressed in
Python, which is always handy for encoding conversions etc):
print "\xc3\xbd".decode("utf-8").encode("latin-1").decode("koi8_r")
to yield: Щ
Assuming that's right, what presumably happened is that the database was
initialized with UTF-8 and loaded with data it was told was in the
latin-1 encoding, but was actually in the koi8_r encoding. Therefore,
utf-8 sequences were generated based on the interpretation of the bytes
for each koi8_r character as the latin-1 character for the same byte
value, so:
Input file: 0xfd (latin-1: ý, koi8_r: Щ)
|
| (input interpreted as latin-1)
v
Database: 0xc3bd (utf-8: ý)
To recover the data you must reverse that process. Thankfully it's going
to be 100% reversible, ie no information has been lost.
To create a tiny test table for the following explanation and fix code I
just ran:
create table ss ( x text) ;
insert into ss (x) values (E'\xc3\xbd');
Now if I:
set client_encoding = "WIN1251";
I get:
test=> select * from ss;
ERROR: character 0xc3bd of encoding "UTF8" has no equivalent in "WIN1251"
just like you.
With client_encoding set as utf-8 (which is what my machine is) I get:
test=> select * from ss;
x
---
ý
(1 row)
PostgreSQL's convert() function:
http://www.postgresql.org/docs/current/static/functions-string.html
may be used now to transform your data. It doesn't assume any encoding
for the input string, unlike convert_from and convert_to, so you can use
a statement like this to convert your data:(where 'x' is the string of
mangled data to be converted):
select (
convert_from(
convert(x::bytea, 'utf-8', 'latin-1'),
'koi8_r')
) from ss;
In other words: "Decode the utf-8 sequence in the input and map each
utf-8 code point to the corresponding latin-1 character, outputting one
byte per latin-1 character. Interpret the sequence of bytes just
produced as a sequence of characters in the koi8_r encoding, and map
them to the same characters in the database's internal encoding."
To copy the converted data to a new table:
CREATE TABLE converted (y text);
INSERT INTO converted(y)
SELECT (
convert_from(
convert(x::bytea, 'utf-8', 'latin-1'),
'koi8_r')
) from ss;
Now if I SELECT from the table of converted data, I get the right(?) output:
test=> select * from converted;
y
---
Щ
(1 row)
You can easily wrap that up in a simple SQL function:
CREATE OR REPLACE FUNCTION fixstr(text) RETURNS text AS $$
SELECT convert_from(convert($1::bytea, 'utf-8', 'latin-1'),'koi8_r')
$$ LANGUAGE 'SQL' IMMUTABLE;
so you can just:
test=> select fixstr(x) from ss;
fixstr
--------
Щ
(1 row)
--
Craig Ringer
From | Date | Subject | |
---|---|---|---|
Next Message | John Smith | 2008-12-02 01:37:09 | Re: slow, long-running 'commit prepared' |
Previous Message | Jason Long | 2008-12-01 23:45:15 | Re: Monty on MySQL 5.1: "Oops, we did it again" |