Re: Mixed UTF8 / Latin1 database

From: Frank Finner <postgresql(at)finner(dot)de>
To: Claudio Cicali <c(dot)cicali(at)mclink(dot)it>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Mixed UTF8 / Latin1 database
Date: 2004-04-18 16:41:34
Message-ID: 20040418184134.16c8931e.postgresql@finner.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, 16 Apr 2004 14:38:33 +0200 Claudio Cicali <c(dot)cicali(at)mclink(dot)it> sat down, thought long and
then wrote:

> Hi,
>
> I'm trying to restore a pg_dump-backed up database from one
> server to another. The problem is that the db is "mixed encoded"
> in UTF-8 and LATIN1... (weird but, yes it is ! It was ported once
> from a hypersonic db... that screwed up something and now I'm
> fighting with that...).
>
> So, trying to restore that db into a UTF-8 encoded new one, gives
> me errors ("invalid unicode character..."), but importing it
> into a LATIN1 econcoded one, gives me weird characters (of course).

Hi,

I had a similiar problem some months ago. I did it like this (all in one line):

PGUSER=postgres ssh -C source_server 'PGUSER=postgres pg_dump -c -t table database'|recode
latin1..utf8|psql -a database postgres

I used the well known UNIX program "recode", which does the job very well. But, the really nasty
thing about this method is, that, if you treat a table that contains already UTF-8 encoded
characters, they will be encoded again to something that is no valid encoding at all. So I first
tried it without recoding, finding out which tables caused errors, then did the job with recoding
only these tables while copying and copying the others like they were. I was quite successful, all
errors had been extinguished afterwards.

If you have mixed tables (tables with Latin1 AND UTF8), I am afraid you have to do the dirty work by
hand, for example, use a Perl script, that reads the dump and does for every line something like

open (INFILE, "< /path/to/input/file"); # This would be your pg_dump´ed mixed up file
open (OUTFILE, "> /path/to/output/file"); # This should become a clean dump with UTF-8
while (<INFILE>)
{
$line=$_;
$line =~ s/ä/\x84/g; # substitutes every "ä" by "\x84" with "\x84" as UTF-8 encoding of "ä"
print OUTFILE "$line";
}
close INFILE;
close OUTFILE;

this means, substitute Latin1 characters (only "ä" in this example) by UTF-8 characters. In German,
there are only 7 of them(äöüÄÖÜß), so it´s not too hard, but I am afraid, your mileage may vary. You
should use a substitution line ($line =~ ...) for every Latin1 character which might occur in your
dump. After substitution you can read in the dump into the UTF-8 database.

Before using the result in production, test, if it is really clean! Well, if you don´t get any more
"invalid unicode character...", it should be OK.

>
> I'm wondering if anyone could have a script or something to help me
> with this situation... :(
>
> thanks.

Hope I could help.

>
>
>
> --
> Claudio Cicali
> c(dot)cicali(at)mclink(dot)it
> http://www.flexer.it
> GPG Key Fingerprint = 2E12 64D5 E5F5 2883 0472 4CFF 3682 E786 555D 25CE
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

Regards, Frank.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Jerry LeVan 2004-04-18 18:41:26 Folding subtotals into query?
Previous Message Andrew Dunstan 2004-04-18 14:17:30 Re: [HACKERS] Remove MySQL Tools from Source?