| From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
|---|---|
| To: | Janine Sisk <janine(at)furfly(dot)net> |
| Cc: | pgsql-general(at)postgresql(dot)org |
| Subject: | Re: Trouble with UTF-8 data |
| Date: | 2008-01-17 23:38:50 |
| Message-ID: | 16915.1200613130@sss.pgh.pa.us |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general |
Janine Sisk <janine(at)furfly(dot)net> writes:
> But I'm still getting this error when loading the data into the new
> database:
> ERROR: invalid byte sequence for encoding "UTF8": 0xeda7a1
The reason PG doesn't like this sequence is that it corresponds to
a Unicode "surrogate pair" code point, which is not supposed to
ever appear in UTF-8 representation --- surrogate pairs are a kluge for
UTF-16 to deal with Unicode code points of more than 16 bits. See
http://en.wikipedia.org/wiki/UTF-16
I think you need a version of iconv that knows how to fold surrogate
pairs into proper UTF-8 form. It might also be that the data is
outright broken --- if this sequence isn't followed by another
surrogate-pair sequence then it isn't valid Unicode by anybody's
interpretation.
7.2.x unfortunately didn't check Unicode data carefully, and would
have let this data pass without comment ...
regards, tom lane
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2008-01-17 23:49:32 | Re: [ADMIN] postgresql in FreeBSD jails: proposal |
| Previous Message | Merlin Moncure | 2008-01-17 23:33:44 | Re: Accessing composite type columns from C |