From: | "Albe Laurenz" <laurenz(dot)albe(at)wien(dot)gv(dot)at> |
---|---|
To: | "Phoenix Kiula *EXTERN*" <phoenix(dot)kiula(at)gmail(dot)com>, "PG-General Mailing List" <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Best practices for moving UTF8 databases |
Date: | 2009-07-21 12:33:39 |
Message-ID: | D960CB61B694CF459DCFB4B0128514C203937E4F@exadv11.host.magwien.gv.at |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Phoenix Kiula wrote:
> > I wonder: why do you spend so much time complaining instead of
> > simply locating the buggy data and fixing them?
>
>
> I'd love to fix them. But if I do a search for
>
> SELECT * FROM xyz WHERE col like '%0x80%'
>
> it doesn't work. How should I search for these characters?
I would get GNU libiconv:
http://www.gnu.org/software/libiconv/
Then take a plain text pg_dump of the database in UTF-8.
Let's call it "db.orig.dmp".
Strip all problem characters:
iconv -f UTF-8 -t UTF-8 -c <db.orig.dmp >db.stripped.sql
Compare both files:
diff -u db.orig.dmp db.stripped.sql
That output will show all lines containing a problem character.
Now the tedious part:
Use "db.orig.dmp" to find out which tables they belong to,
locate the records in the database by primary key and fix them.
Alternatively, you can use iconv's auto-repair if you know which
bytes give you a problem.
For example, if you know that the trouble stems only from 0x80 bytes
that should be Euro symbols, you could:
iconv -f UTF-8 -t UTF-8 --byte-subst="<0x%x>" <db.orig.dmp | sed -e 's/<0x80>/EUR/g' >db.fixed.sql
The resulting "db.fixed.sql" could then be loaded into the new database.
Yours,
Laurenz Albe
From | Date | Subject | |
---|---|---|---|
Next Message | Craig Ringer | 2009-07-21 12:44:41 | Re: ***UNCHECKED*** Re: memory leak occur when disconnect database |
Previous Message | Konstantin Pavlov | 2009-07-21 11:37:44 | Re: Full text search in PostgreSQL 8.4 [SOLVED] |