Quick Links

Re: Trouble with UTF-8 data

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Janine Sisk <janine(at)furfly(dot)net>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Trouble with UTF-8 data
Date:	2008-01-17 23:38:50
Message-ID:	16915.1200613130@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Janine Sisk <janine(at)furfly(dot)net> writes:
> But I'm still getting this error when loading the data into the new
> database:

> ERROR: invalid byte sequence for encoding "UTF8": 0xeda7a1

The reason PG doesn't like this sequence is that it corresponds to
a Unicode "surrogate pair" code point, which is not supposed to
ever appear in UTF-8 representation --- surrogate pairs are a kluge for
UTF-16 to deal with Unicode code points of more than 16 bits. See

http://en.wikipedia.org/wiki/UTF-16

I think you need a version of iconv that knows how to fold surrogate
pairs into proper UTF-8 form. It might also be that the data is
outright broken --- if this sequence isn't followed by another
surrogate-pair sequence then it isn't valid Unicode by anybody's
interpretation.

7.2.x unfortunately didn't check Unicode data carefully, and would
have let this data pass without comment ...

regards, tom lane

In response to

Trouble with UTF-8 data at 2008-01-17 23:02:22 from Janine Sisk

Responses

Re: Trouble with UTF-8 data at 2008-01-18 08:00:21 from Albe Laurenz

Browse pgsql-general by date

	From	Date	Subject
Next Message	Tom Lane	2008-01-17 23:49:32	Re: [ADMIN] postgresql in FreeBSD jails: proposal
Previous Message	Merlin Moncure	2008-01-17 23:33:44	Re: Accessing composite type columns from C