Re: ERROR: invalid byte sequence for encoding "UTF8": 0xc35c

From: Jasmin Dizdarevic <jasmin(dot)dizdarevic(at)gmail(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, pgsql-general(at)postgresql(dot)org
Subject: Re: ERROR: invalid byte sequence for encoding "UTF8": 0xc35c
Date: 2011-03-03 01:18:36
Message-ID: AANLkTintDxjykQ8R2eazvXr8Kbqvm=8opjw0YSaYZnbq@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

@ALL: Isn't it possible and wise to include an (optional) encoder in pgsql?

we're importing a lot of data from textfiles, which are not utf-8. we always
have to change the encoding in another tool before using COPY.

2011/2/28 Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>

> On 27/02/11 20:47, AI Rumman wrote:
> > I am getting error in Postgresql 9.0.1.
> >
> > update import_details_test
> > set data_row = '["4","1 Monor JoÃ\u083ão S. AntÃ\u0083ão
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Because your email client may have transformed the text encoding, I
> can't make any certain conclusions about what you're actually sending to
> the database, but it's highly likely that you're sending latin-1 encoded
> text to the database while your client_encoding is set to 'utf8'.
>
> The marked text is most likely the problem... but I think there's more
> wrong with it than just being latin-1 encoded. That kind of mangling
> often comes about when utf-8 text has been incorrectly interpreted as
> latin-1 and modified, or when something has incorrectly tried to do
> utf8<->latin-1 conversions more than once. You really need to figure out
> what encoding your input is in, convert it to a known encoding like
> utf-8 *once*, and keep it that way.
>
> If you're using Python, which I suspect you might be, the "".decode()
> function is useful. For example, I can convert a latin-1 encoded byte
> string to a python Unicode string with:
>
> "somelatin1string".decode("latin-1")
>
> Sometimes you can get away with just "SET client_encoding=latin-1" but
> in this case your string data looks like it's been mangled by more than
> just a single encoding mis-interpretation, so you'll probably just
> silently insert corrupt data by doing that. Don't. Fix your code so it
> knows what the text encoding of the input is.
>
> If you are, in fact, using Python, it's a really good idea to always
> "".decode() all your inputs so your internal processing is done in
> Unicode (UTF-16, in fact). Similarly, Qt programmers should convert
> everything to unicode QString as soon as possible and use that for all
> internal manipulation. It'll save a lot of pain.
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2011-03-03 01:45:06 Re: ERROR: invalid byte sequence for encoding "UTF8": 0xc35c
Previous Message Craig Ringer 2011-03-03 01:11:31 Re: PG on two nodes with shared disk ocfs2 & drbd