From: | "John Hansen" <john(at)geeknet(dot)com(dot)au> |
---|---|
To: | <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Unicode problems on IRC |
Date: | 2005-04-10 21:34:00 |
Message-ID: | 5066E5A966339E42AA04BA10BA706AE5628E@rodrick.geeknet.com.au |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
>On 2005-04-10, Tom Lane <tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot )
us> wrote:
>> Andrew - Supernews <andrew+nonews ( at ) supernews ( dot ) com>
writes:
>>> I think you will find that this impression is actually false. Or
that at
>>> the very least, _correct_ verification of UTF-8 sequences will still
>>> catch essentially all cases of non-utf-8 input mislabelled as utf-8
>>> while allowing the full range of Unicode codepoints.
>>
>> Yeah? Cool. Does John's proposed patch do it "correctly"?
>>
>> http://candle.pha.pa.us/mhonarc/patches2/msg00076.html
>
>It looks correct to me. The only thing I think that code will let
through
>incorrectly are encoded surrogates; those could be fixed by adding one
line:
>
> switch (*source) {
> /* no fall-through in this inner switch */
> case 0xE0: if (a < 0xA0) return false; break;
>+ case 0xED: if (a > 0x9F) return false; break;
> case 0xF0: if (a < 0x90) return false; break;
> case 0xF4: if (a > 0x8F) return false; break;
>
That's right, dono how I missed that one, but looks correct to me, and
is in line with the code in ConvertUTF.c from unicode.org, on which I
based the patch, extended to support 6 byte utf8 characters.
>(Accepting encoded surrogates in utf-8 was always forbidden by most
>specifications that used utf-8, though the Unicode specs originally
were
>not absolute about it (but forbade generating them). Current Unicode
>specifications define those sequences as malformed. Surrogates are the
>code points from 0xD800 - 0xDFFF, which are used in UTF-16 to encode
>characters 0x10000 - 0x10FFFF as two 16-bit values; UTF-8 requires that
>such characters are encoded directly rather than via surrogate pairs.)
>
>--
>Andrew, Supernews
>http://www.supernews.com - individual and corporate NNTP services
... John
From | Date | Subject | |
---|---|---|---|
Next Message | Euler Taveira de Oliveira | 2005-04-10 22:17:29 | Re: Case Sensitivity |
Previous Message | Greg Sabino Mullane | 2005-04-10 21:18:14 | Re: Tab-completion feature ? |