From: | Frank Finner <postgresql(at)finner(dot)de> |
---|---|
To: | Claudio Cicali <c(dot)cicali(at)mclink(dot)it> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Mixed UTF8 / Latin1 database |
Date: | 2004-04-18 16:41:34 |
Message-ID: | 20040418184134.16c8931e.postgresql@finner.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Fri, 16 Apr 2004 14:38:33 +0200 Claudio Cicali <c(dot)cicali(at)mclink(dot)it> sat down, thought long and
then wrote:
> Hi,
>
> I'm trying to restore a pg_dump-backed up database from one
> server to another. The problem is that the db is "mixed encoded"
> in UTF-8 and LATIN1... (weird but, yes it is ! It was ported once
> from a hypersonic db... that screwed up something and now I'm
> fighting with that...).
>
> So, trying to restore that db into a UTF-8 encoded new one, gives
> me errors ("invalid unicode character..."), but importing it
> into a LATIN1 econcoded one, gives me weird characters (of course).
Hi,
I had a similiar problem some months ago. I did it like this (all in one line):
PGUSER=postgres ssh -C source_server 'PGUSER=postgres pg_dump -c -t table database'|recode
latin1..utf8|psql -a database postgres
I used the well known UNIX program "recode", which does the job very well. But, the really nasty
thing about this method is, that, if you treat a table that contains already UTF-8 encoded
characters, they will be encoded again to something that is no valid encoding at all. So I first
tried it without recoding, finding out which tables caused errors, then did the job with recoding
only these tables while copying and copying the others like they were. I was quite successful, all
errors had been extinguished afterwards.
If you have mixed tables (tables with Latin1 AND UTF8), I am afraid you have to do the dirty work by
hand, for example, use a Perl script, that reads the dump and does for every line something like
open (INFILE, "< /path/to/input/file"); # This would be your pg_dump´ed mixed up file
open (OUTFILE, "> /path/to/output/file"); # This should become a clean dump with UTF-8
while (<INFILE>)
{
$line=$_;
$line =~ s/ä/\x84/g; # substitutes every "ä" by "\x84" with "\x84" as UTF-8 encoding of "ä"
print OUTFILE "$line";
}
close INFILE;
close OUTFILE;
this means, substitute Latin1 characters (only "ä" in this example) by UTF-8 characters. In German,
there are only 7 of them(äöüÄÖÜß), so it´s not too hard, but I am afraid, your mileage may vary. You
should use a substitution line ($line =~ ...) for every Latin1 character which might occur in your
dump. After substitution you can read in the dump into the UTF-8 database.
Before using the result in production, test, if it is really clean! Well, if you don´t get any more
"invalid unicode character...", it should be OK.
>
> I'm wondering if anyone could have a script or something to help me
> with this situation... :(
>
> thanks.
Hope I could help.
>
>
>
> --
> Claudio Cicali
> c(dot)cicali(at)mclink(dot)it
> http://www.flexer.it
> GPG Key Fingerprint = 2E12 64D5 E5F5 2883 0472 4CFF 3682 E786 555D 25CE
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
Regards, Frank.
From | Date | Subject | |
---|---|---|---|
Next Message | Jerry LeVan | 2004-04-18 18:41:26 | Folding subtotals into query? |
Previous Message | Andrew Dunstan | 2004-04-18 14:17:30 | Re: [HACKERS] Remove MySQL Tools from Source? |