From: | Mike Lewis <mikelikespie(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: BUG #5532: Valid UTF8 sequence errors as invalid |
Date: | 2010-06-30 18:05:24 |
Message-ID: | AANLkTinajUG0XG6bxYO2cuEKhUN_1cMf0HH_lFdy-ily@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
>
>
>
> It is not valid. See http://tools.ietf.org/html/rfc3629 --- a sequence
> beginning with ED must have a second byte in the range 80-9F to be
> legal, and this doesn't. The example you give would decode as U+DF2D,
> ie part of a surrogate pair, which is specifically disallowed in UTF8
> --- you're supposed to code the original character directly, not via a
> surrogate pair. The primary reason for this rule is that otherwise
> there are multiple ways to encode the same character, which can be a
> security hazard.
>
>
Thanks for the explanation. Unicode has always given me a hard time.
>
> You should file bugs against those tools.
>
> I certainly will. I apologize for filing the bug against postgres (I
suppose the "voting" method of figuring out which piece software is the
buggy one has failed me).
I've run into a fair amount of unicode errors when trying to copy in log
files. Would you recommend using bytea or another data type instead of text
or varchar... or at least copying to a staging table with bytea's and
filtering out invalid rows when moving it to the main table?
From | Date | Subject | |
---|---|---|---|
Next Message | Heikki Linnakangas | 2010-06-30 18:14:11 | Re: [BUGS] Server crash while trying to read expression using pg_get_expr() |
Previous Message | Tom Lane | 2010-06-30 16:44:45 | Re: BUG #5532: Valid UTF8 sequence errors as invalid |