From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Chapman Flack <chap(at)anastigmatix(dot)net> |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: The "char" type versus non-ASCII characters |
Date: | 2021-12-03 19:35:03 |
Message-ID: | c44b31d4-044a-0e45-1a98-995517b47df7@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 12/3/21 14:12, Tom Lane wrote:
> [ breaking off a different new thread ]
>
> Chapman Flack <chap(at)anastigmatix(dot)net> writes:
>> Then there's "char". It's category S, but does not apply the server
>> encoding. You could call it an 8-bit int type, but it's typically used
>> as a character, making it well-defined for ASCII values and not so
>> for others, just like SQL_ASCII encoding. You could as well say that
>> the "char" type has a defined encoding of SQL_ASCII at all times,
>> regardless of the database encoding.
> This reminds me of something I've been intending to bring up, which
> is that the "char" type is not very encoding-safe. charout() for
> example just regurgitates the single byte as-is. I think we deemed
> that okay the last time anyone thought about it, but that was when
> single-byte encodings were the mainstream usage for non-ASCII data.
> If you're using UTF8 or another multi-byte server encoding, it's
> quite easy to get an invalidly-encoded string this way, which at
> minimum is going to break dump/restore scenarios.
>
> I can think of at least three ways we might address this:
>
> * Forbid all non-ASCII values for type "char". This results in
> simple and portable semantics, but it might break usages that
> work okay today.
>
> * Allow such values only in single-byte server encodings. This
> is a bit messy, but it wouldn't break any cases that are not
> problematic already.
>
> * Continue to allow non-ASCII values, but change charin/charout,
> char_text, etc so that the external representation is encoding-safe
> (perhaps make it an octal or decimal number).
>
> Either of the first two ways would have to contemplate what to do
> with disallowed values that snuck into the DB via pg_upgrade.
> That leads me to think that the third way might be the most
> preferable, even though it's not terribly backward-compatible.
>
I don't like #2. Is #3 going to change the external representation only
for non-ASCII values? If so, that seems OK. Changing it for ASCII
values seems ugly. #1 is the simplest to implement and to understand,
and I suspect it would break very little in practice, but others might
disagree with that assessment.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2021-12-03 19:42:11 | Re: The "char" type versus non-ASCII characters |
Previous Message | Tom Lane | 2021-12-03 19:12:10 | The "char" type versus non-ASCII characters |