Quick Links

Re: Bug in UTF8-Validation Code?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Mark Dilger <pgsql(at)markdilger(dot)com>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Albe Laurenz <all(at)adv(dot)magwien(dot)gv(dot)at>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Bug in UTF8-Validation Code?
Date:	2007-04-03 17:06:38
Message-ID:	9805.1175619998@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Mark Dilger <pgsql(at)markdilger(dot)com> writes:
> Martijn van Oosterhout wrote:
>> Just about every multibyte encoding other than Unicode has the problem
>> of not distinguishing between the code point and the encoding of it.

> Thanks for the feedback. Would you say that the way I implemented things in the
> example code would be correct for multibyte non Unicode encodings?

I think it's probably defensible for non-Unicode encodings. To do
otherwise would require (a) figuring out what the equivalent concept to
"code point" is for each encoding, and (b) having a separate code path
for each encoding to perform the mapping. It's not clear that there
even is an answer to (a), and (b) seems like more work than chr() is
worth. But we know what the right way is for Unicode, so we should
special case that one.

Note the points made that in all cases ascii() and chr() should be
inverses, and that you shouldn't just fall back to the old behavior
in SQL_ASCII encoding. (My vote for SQL_ASCII would be to reject
values > 255.)

regards, tom lane

In response to

Re: Bug in UTF8-Validation Code? at 2007-04-03 15:47:14 from Mark Dilger

Responses

Re: Bug in UTF8-Validation Code? at 2007-04-04 06:01:56 from Martijn van Oosterhout

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2007-04-03 17:10:59	Re: Implicit casts to text
Previous Message	Josh Berkus	2007-04-03 17:04:24	Re: Implicit casts to text