From: | Chapman Flack <chap(at)anastigmatix(dot)net> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Nico Williams <nico(at)cryptonector(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Pre-proposal: unicode normalized text |
Date: | 2023-10-04 21:32:50 |
Message-ID: | 02d05bc98ca5b2d8ab38fec5fe5b7625@anastigmatix.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2023-10-04 16:38, Jeff Davis wrote:
> On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote:
>> The SQL standard would have me able to:
>>
>> CREATE TABLE foo (
>> a CHARACTER VARYING CHARACTER SET UTF8,
>> b CHARACTER VARYING CHARACTER SET LATIN1
>> )
>>
>> and so on
>
> Is there a use case for that? UTF-8 is able to encode any unicode code
> point, it's relatively compact, and it's backwards-compatible with 7-
> bit ASCII. If you have a variety of text data in your system (and in
> many cases even if not), then UTF-8 seems like the right solution.
Well, for what reason does anybody run PG now with the encoding set
to anything besides UTF-8? I don't really have my finger on that pulse.
Could it be that it bloats common strings in their local script, and
with enough of those to store, it could matter to use the local
encoding that stores them more economically?
Also, while any Unicode transfer format can encode any Unicode code
point, I'm unsure whether it's yet the case that {any Unicode code
point} is a superset of every character repertoire associated with
every non-Unicode encoding.
The cheap glaring counterexample is SQL_ASCII. Half those code points
are *nobody knows what Unicode character* (or even *whether*). I'm not
insisting that's a good thing, but it is a thing.
It might be a very tidy future to say all text is Unicode and all
server encodings are UTF-8, but I'm not sure it wouldn't still
be a good step on the way to be able to store some things in
their own encodings. We have JSON and XML now, two data types
that are *formally defined* to accept any Unicode content, and
we hedge and mumble and say (well, as long as it goes in the
server encoding) and that makes me sad. Things like that should
be easy to handle even without declaring UTF-8 as a server-wide
encoding ... they already are their own distinct data types, and
could conceivably know their own encodings.
But there again, it's possible that going with unconditional
UTF-8 for JSON or XML documents could, in some regions, bloat them.
Regards,
-Chap
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2023-10-04 21:37:40 | Re: Pre-proposal: unicode normalized text |
Previous Message | Nico Williams | 2023-10-04 21:15:06 | Re: Pre-proposal: unicode normalized text |