Quick Links

Re: Update Unicode data to Unicode 16.0.0

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Joe Conway <mail(at)joeconway(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Update Unicode data to Unicode 16.0.0
Date:	2025-03-21 14:45:47
Message-ID:	CA+TgmoahWh6zxBFUygOUwrdkGogp47ZVoa8GVkxEORRc7+EwWA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Mar 21, 2025 at 2:45 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Thu, 2025-03-20 at 08:45 -0400, Robert Haas wrote:
> > * When the collation/ctype/whatever definitions upon which you are
> > relying change, you can either decide to switch to the new ones
> > without rebuilding your indexes and risk wrong results until you
> > reindex, or you can decide to create new indexes using the new
> > definitions and drop the old ones.
>
> Would newly-created objects pick up the new Unicode version, or stick
> with the old one?

Hmm, I hadn't thought about that. I'm assuming that the Unicode
version would need, in this scheme, to be coupled to the object that
depends on it. For example, an index that uses a Unicode collation
would need to store a Unicode version. But for a new index, how would
that be set? Maybe the Unicode version would be treated as part of the
collation. I'm guessing that an index defaults to the column
collation, and I think the column collation defaults to the database
default collation. We might need a way for ALTER DATABASE to allow the
database default to be adjusted. I'm not quite sure here, but my
general feeling is that Unicode version feels like part of the
collation and that we should avoid introducing a separate mechanism if
possible. What are your thoughts?

> Supprting built-in natural language sort orders would be a much larger
> scope. And I don't think we need that, but that's a larger discussion.

I'm curious why you think this. My own feeling (as I think you
probably know, but just to be clear) is that relatively few people
need extremely precise control over their collation behavior, but
there are some who do. However, I think there are many people for whom
a code-point sort won't be good enough. If you want to leave this
discussion for another time, that's fine.

> What if we were able to tell, for instance, that your database has none
> of the codepoints affected by the most recent update. Then updating
> would be less risky than not updating: if you don't update Unicode,
> then the code points could end up in the database treated as
> unassigned, and then cause a problem for future updates.

The problem with this is that it requires scanning the whole database.
That's not to say it's useless. Some people can afford to scan the
whole database, and some people might even WANT to scan the whole
database just to give themselves peace of mind. But there are also
plenty of people for whom this is a major downside, even unusable. I'd
like to have a solution that is based on metadata.

Maybe we should actually move in the direction of having encodings
that are essentially specific versions of Unicode. Instead of just
having UTF-8 that accepts whatever, you could have UTF-8.v16.0.0 or
whatever, which would only accept code points known to that version of
Unicode. Or maybe this shouldn't be entirely new encodings but
something vaguely akin to a typmod, so that you could have columns of
type text[limited_to_unicode_v16_0_0] or whatever. If we actually
exclude unassigned code points, then we know they aren't there, and we
can make deductions about what is safe to do based on that
information. I'm not quite sure how useful that is, but I tend to
think that enforcing rules when the data goes in has a decent shot at
being better than letting anything going in and then having to scan it
later to see how it all turned out.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Re: Update Unicode data to Unicode 16.0.0 at 2025-03-21 06:45:10 from Jeff Davis

Responses

Re: Update Unicode data to Unicode 16.0.0 at 2025-03-21 20:45:24 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Alvaro Herrera	2025-03-21 14:50:37	Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints
Previous Message	Alvaro Herrera	2025-03-21 14:39:17	Re: Test to dump and restore objects left behind by regression