From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Janine Sisk <janine(at)furfly(dot)net> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Trouble with UTF-8 data |
Date: | 2008-01-17 23:38:50 |
Message-ID: | 16915.1200613130@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Janine Sisk <janine(at)furfly(dot)net> writes:
> But I'm still getting this error when loading the data into the new
> database:
> ERROR: invalid byte sequence for encoding "UTF8": 0xeda7a1
The reason PG doesn't like this sequence is that it corresponds to
a Unicode "surrogate pair" code point, which is not supposed to
ever appear in UTF-8 representation --- surrogate pairs are a kluge for
UTF-16 to deal with Unicode code points of more than 16 bits. See
http://en.wikipedia.org/wiki/UTF-16
I think you need a version of iconv that knows how to fold surrogate
pairs into proper UTF-8 form. It might also be that the data is
outright broken --- if this sequence isn't followed by another
surrogate-pair sequence then it isn't valid Unicode by anybody's
interpretation.
7.2.x unfortunately didn't check Unicode data carefully, and would
have let this data pass without comment ...
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2008-01-17 23:49:32 | Re: [ADMIN] postgresql in FreeBSD jails: proposal |
Previous Message | Merlin Moncure | 2008-01-17 23:33:44 | Re: Accessing composite type columns from C |