Quick Links

Re: Pre-proposal: unicode normalized text

From:	Nico Williams <nico(at)cryptonector(dot)com>
To:	Chapman Flack <chap(at)anastigmatix(dot)net>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Pre-proposal: unicode normalized text
Date:	2023-10-04 22:15:47
Message-ID:	ZR3kE0didxGGfSif@ubby21
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Oct 04, 2023 at 05:32:50PM -0400, Chapman Flack wrote:
> Well, for what reason does anybody run PG now with the encoding set
> to anything besides UTF-8? I don't really have my finger on that pulse.

Because they still have databases that didn't use UTF-8 10 or 20 years
ago that they haven't migrated to UTF-8?

It's harder to think of why one might _want_ to store text in any
encoding other than UTF-8 for _new_ databases.

Though too there's no reason that it should be impossible other than
lack of developer interest: as long as text is tagged with its encoding,
it should be possible to store text in any number of encodings.

> Could it be that it bloats common strings in their local script, and
> with enough of those to store, it could matter to use the local
> encoding that stores them more economically?

UTF-8 bloat is not likely worth the trouble. UTF-8 is only clearly
bloaty when compared to encodings with 1-byte code units, like
ISO-8859-*. For CJK UTF-8 is not much more bloaty than native
non-Unicode encodings like SHIFT_JIS.

UTF-8 is not much bloatier than UTF-16 in general either.

Bloat is not really a good reason to avoid Unicode or any specific TF.

> Also, while any Unicode transfer format can encode any Unicode code
> point, I'm unsure whether it's yet the case that {any Unicode code
> point} is a superset of every character repertoire associated with
> every non-Unicode encoding.

It's not always been the case that Unicode is a strict superset of all
currently-in-use human scripts. Making Unicode a strict superset of all
currently-in-use human scripts seems to be the Unicode Consortium's aim.

I think you're asking why not just use UTF-8 for everything, all the
time. It's a fair question. I don't have a reason to answer in the
negative (maybe someone else does). But that doesn't mean that one
couldn't want to store text in many encodings (e.g., for historical
reasons).

Nico
--

In response to

Re: Pre-proposal: unicode normalized text at 2023-10-04 21:32:50 from Chapman Flack

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2023-10-04 22:55:44	Re: Add annotation syntax to pg_hba.conf entries
Previous Message	Jeff Davis	2023-10-04 21:37:40	Re: Pre-proposal: unicode normalized text