Quick Links

Re: Unicode string literals versus the world

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Sam Mason <sam(at)samason(dot)me(dot)uk>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Unicode string literals versus the world
Date:	2009-04-16 15:34:27
Message-ID:	49E75003.9050003@dunslane.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Tom Lane wrote:
> Sam Mason <sam(at)samason(dot)me(dot)uk> writes:
>
>> I'd never heard of UTF-16 surrogate pairs before this discussion and
>> hence didn't realise that it's valid to have a surrogate pair in place
>> of a single code point. The docs say that <D800 DF02> corresponds to
>> U+10302, Python would appear to follow my intuitions in that:
>>
>
>
>> ord(u'\uD800\uDF02')
>>
>
>
>> results in an error instead of giving back 66306, as I'd expect. Is
>> this a bug in Python, my understanding, or something else?
>>
>
> I might be wrong, but I think surrogate pairs are expressly forbidden in
> all representations other than UTF16/UCS2. We definitely forbid them
> when validating UTF-8 strings --- that's per an RFC recommendation.
> It sounds like Python is doing the same.
>
>
>

You mustn't encode the surrogate, but it's up to us how we allow people
to designate a given code point.

Frankly, I think we shouldn't provide for using surrogates at all. I
would prefer something like \uXXXX for BMP items and \UXXXXXXXX as the
straight 32bit designation of a higher codepoint.

cheers

andrew

In response to

Re: Unicode string literals versus the world at 2009-04-16 14:54:16 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2009-04-16 15:36:54	Re: [GENERAL] Performance of full outer join in 8.3
Previous Message	Marko Kreen	2009-04-16 15:34:06	Re: Unicode string literals versus the world