Quick Links

Re: Update Unicode data to Unicode 16.0.0

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Joe Conway <mail(at)joeconway(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Update Unicode data to Unicode 16.0.0
Date:	2025-03-19 12:46:10
Message-ID:	CA+Tgmoa7m6umcjnode1YO09gEHno1D_-V-+3VWmKyjLnXV7JDQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Mar 18, 2025 at 10:33 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> If we compare the following two problems:
>
> A. With glibc or ICU, every text index, including primary keys, are
> highly vulnerable to inconsistencies after an OS upgrade, even if
> there's no Postgres upgrade; vs.
>
> B. With the builtin provider, only expression indexes and a few other
> things are vulnerable, only during a major version upgrade, and mostly
> (but not entirely) when using recently-assigned Cased letters.
>
> To me, problem A seems about 100 times worse than B almost any way I
> can imagine measuring it: number of objects vulnerable, severity of the
> problem when it does happen, likelihood of a vulnerable object having
> an actual problem, etc. If you disagree, I'd like to hear more.

I see your point, but most people don't use the builtin collation
provider. Granted, we could change the default and then more people
would use it, but I'm not sure people would be happy with the
resulting behavior: a lot of people probably want "a" to sort near "á"
even if they don't have strong preferences about the exact details in
every corner case.

Also, and I think rather importantly, many people are less sensitive
to whether anything is actually broken than to whether anything
hypothetically could be broken. When an EDB customer asks "if I do X,
will anything break," it's often the case that answering "maybe" is
the same as answering "yes". The DBA doesn't necessarily know or care
what the application does or know or care what data is in the
database. They want a hard guarantee that the behavior will not
change. From that point of view, your statement that nothing will
change in minor releases when the builtin provider is used is quite
powerful (and a good argument against back-patching Unicode updates as
Tom proposes).

But people will still need to use other collation providers and they
will still need to do major release upgrades and they also want those
things to be guaranteed not to break. Again, I'm not trying to oblige
you to deliver that behavior and I confess to ignorance on how we
could realistically get there. But I do think it's what people want:
to be forced to endure collation updates infrequently, and to be able
to choose the timing of the update when they absolutely must happen,
and to be able to easily know exactly what they need to reindex.

And from that point of view -- and again, I'm not volunteering to
implement it and I'm not telling you to do it either -- Joe's proposal
of supporting multiple versions sounds fantastic. Because then, I can
do a major version upgrade using pg_upgrade and keep everything pinned
to the old Unicode version or, perhaps even the old ICU version if we
had multi-version libicu support. I may be able to go through several
major version upgrades without ever needing to survive a collation
change. Eventually my hand will be forced, because PostgreSQL will
remove support for the Unicode version I care about or that old
version of libicu won't compile any more or will have security
vulnerabilities or something, but I will have the option to deal with
that collation change before or after any PostgreSQL version changes
that I'm doing. I'll be able to change the collation version at a time
when I'm not changing anything else and deal with JUST that fallout on
its own.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Re: Update Unicode data to Unicode 16.0.0 at 2025-03-19 02:33:00 from Jeff Davis

Responses

Re: Update Unicode data to Unicode 16.0.0 at 2025-03-19 17:39:29 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2025-03-19 12:59:03	Re: Update Unicode data to Unicode 16.0.0
Previous Message	David Rowley	2025-03-19 12:41:51	Re: Add missing tab completion for VACUUM and ANALYZE with ONLY option