| From: | Reece Hart <reece(at)harts(dot)net> | 
|---|---|
| To: | SF PostgreSQL <sfpug(at)postgresql(dot)org> | 
| Subject: | Re: locale and encoding advice | 
| Date: | 2006-08-04 18:14:50 | 
| Message-ID: | 1154715290.16476.419.camel@tallac.gene.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | sfpug | 
David Fetter wrote:
        I'd go with a C locale and a UTF8 encoding.  The downside of
        this is
        that you may have some trouble importing your old data.
Perhaps you mean this kind of trouble?
        psql:<stdin>:142677171: ERROR:  invalid byte sequence for
        encoding "UTF8": 0x96
This was obtained roughly like this:
$ createdb -E UTF8 newdb
$ pg_dump ... olddb | psql -d newdb -qaf-
The pg_dump prologue has a line:
SET client_encoding = 'SQL_ASCII';
As I said the OP, I need the idiot's guide to encoding... please let me
check my understanding.  olddb data is stored on disk in the
server_encoding (SQL_ASCII).  When I pg_dump, it's written in the
server_encoding and this is indicated by explicitly setting
client_encoding in the dump prologue.  When I load into a UTF8 db,
client_encoding is set to SQL_ASCII (it would otherwise be newdb's UTF8
by default), server_encoding is UTF8, and the backend (?) tries to
translate SQL_ASCII chars to UTF8 chars.  In my case, we've choked on
this translation.  Is that it?
Here's how I addressed this particular problem:
In looking at the log, I can see which table this is in.  It's data I
imported from elsewhere.  To find the offending lines, I did:
=> select * from mint \g | perl -ne 'print if $.<3 or m/\x96/' | less
It turns out that only 7 rows are problematic in this way. I did this:
=> update mint set comments=replace(comments,'\x96','-') where
comments~'\x96';
I'm reloading now and we'll see whether I had other problems.
In the future, could I pipe through recode (or other tool?) to do the
translation outside of postgresql?  Does anyone have a recommendation to
fix this sort of problem on a larger scale or when a more sophisticated
translation is needed (than my update statement above, I mean).
Thanks for your help.
-Reece
-- 
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Josh Berkus | 2006-08-04 23:39:20 | Initial LWE information up | 
| Previous Message | Reece Hart | 2006-08-04 04:43:57 | Re: locale and encoding advice |