From: | "Markus Wollny" <Markus(dot)Wollny(at)computec(dot)de> |
---|---|
To: | "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us> |
Cc: | <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: invalid UTF-8 byte sequence detected |
Date: | 2005-11-16 10:16:42 |
Message-ID: | 28011CD60FB1724DBA4442E38277F626307015@hermes.computec.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi!
> -----Ursprüngliche Nachricht-----
> Von: Bruce Momjian [mailto:pgman(at)candle(dot)pha(dot)pa(dot)us]
> Gesendet: Dienstag, 15. November 2005 19:46
> An: Markus Wollny
> Cc: pgsql-general(at)postgresql(dot)org
> Betreff: Re: [GENERAL] invalid UTF-8 byte sequence detected
>
>
> I am also confused how invalid UTF8 sequences got into your database.
> It shouldn't have been possible.
>
Our databases were originally encoded in ASCII, though they should have been LATIN9 - or UNICODE; this has been remedied a long time ago using iconv on the dumps; our PostgreSQL-version then was 7.4 and we converted databases and dumps to UTF-8. Maybe the invalid byte sequences have been entered sometimes during our migration from ODBC to JDBC while our encoding was still a mess - though I would have thought that this should have been resolved by dump&iconv&restore then. However, I do suspect that the cause of the issue was really more or less a bug in PostgreSQL <8.1, which accepted certain illegal byte sequences. I our case, I found that the re-import of the dump errored out on ISO-8859-1 encoded backticks (´) - certain mournfully misled individuals do use this "degu"-character instead of the apostrophe even tough it's more difficult to type on a german keyboard layout. And quite wrong, too.
Anyway, I found some reference in the hackers-list that encoding-consistency for Unicode has been tightened down (see http://archives.postgresql.org/pgsql-hackers/2005-10/msg00972.php ). Both a solution and a suggestion have been posted in this thread; Christopher Kings-Lynne has suggested to include a reference to this issue in the 'upgrading/back compatibiliy' section for these release notes - I do strongly second his suggestion :)
The suggested solution was to feed the plain dump again through iconv; however at least on my systems (Debian Sarge) iconv didn't like my >5GB sized dump files. So in order to successfully reimport the dumps, I had to "split --line-bytes=650m" the SQL-file, pass the parts through iconv -c -f UTF8 -t UTF8 and concatenate them back into one file again. There were no more errors on feeding the dump back into psql and I didn't come across any missing data during my tests, so this has definitely done the trick for me.
As 8.1 has tightened down encoding-consistency for Unicode, I believe that the databases should be safe from any illegal byte-sequences in text-fields from now on.
Kind regards
Markus
From | Date | Subject | |
---|---|---|---|
Next Message | Sebastian Böck | 2005-11-16 10:34:18 | Re: Bug with index-usage? |
Previous Message | Alexander Presber | 2005-11-16 09:17:34 | Tsearch2: casting text to tsquery |