Quick Links

Re: The "char" type versus non-ASCII characters

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Chapman Flack <chap(at)anastigmatix(dot)net>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: The "char" type versus non-ASCII characters
Date:	2021-12-03 19:35:03
Message-ID:	c44b31d4-044a-0e45-1a98-995517b47df7@dunslane.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 12/3/21 14:12, Tom Lane wrote:
> [ breaking off a different new thread ]
>
> Chapman Flack <chap(at)anastigmatix(dot)net> writes:
>> Then there's "char". It's category S, but does not apply the server
>> encoding. You could call it an 8-bit int type, but it's typically used
>> as a character, making it well-defined for ASCII values and not so
>> for others, just like SQL_ASCII encoding. You could as well say that
>> the "char" type has a defined encoding of SQL_ASCII at all times,
>> regardless of the database encoding.
> This reminds me of something I've been intending to bring up, which
> is that the "char" type is not very encoding-safe. charout() for
> example just regurgitates the single byte as-is. I think we deemed
> that okay the last time anyone thought about it, but that was when
> single-byte encodings were the mainstream usage for non-ASCII data.
> If you're using UTF8 or another multi-byte server encoding, it's
> quite easy to get an invalidly-encoded string this way, which at
> minimum is going to break dump/restore scenarios.
>
> I can think of at least three ways we might address this:
>
> * Forbid all non-ASCII values for type "char". This results in
> simple and portable semantics, but it might break usages that
> work okay today.
>
> * Allow such values only in single-byte server encodings. This
> is a bit messy, but it wouldn't break any cases that are not
> problematic already.
>
> * Continue to allow non-ASCII values, but change charin/charout,
> char_text, etc so that the external representation is encoding-safe
> (perhaps make it an octal or decimal number).
>
> Either of the first two ways would have to contemplate what to do
> with disallowed values that snuck into the DB via pg_upgrade.
> That leads me to think that the third way might be the most
> preferable, even though it's not terribly backward-compatible.
>

I don't like #2. Is #3 going to change the external representation only
for non-ASCII values? If so, that seems OK. Changing it for ASCII
values seems ugly. #1 is the simplest to implement and to understand,
and I suspect it would break very little in practice, but others might
disagree with that assessment.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

In response to

The "char" type versus non-ASCII characters at 2021-12-03 19:12:10 from Tom Lane

Responses

Re: The "char" type versus non-ASCII characters at 2021-12-03 19:42:11 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2021-12-03 19:42:11	Re: The "char" type versus non-ASCII characters
Previous Message	Tom Lane	2021-12-03 19:12:10	The "char" type versus non-ASCII characters