Quick Links

Re: Update Unicode data to Unicode 16.0.0

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Joe Conway <mail(at)joeconway(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Update Unicode data to Unicode 16.0.0
Date:	2025-03-19 18:33:43
Message-ID:	CA+TgmoYmT90FNueeedVJhm9MO_XH-HPgtyZwvQFzhTSHfmTSTQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Mar 19, 2025 at 1:39 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Wed, 2025-03-19 at 08:46 -0400, Robert Haas wrote:
> > I see your point, but most people don't use the builtin collation
> > provider.
>
> The other providers aren't affected by us updating Unicode, so I think
> we got off track somehow. I suppose what I meant was:
>
> "If you are concerned about inconsistencies, and you move to the
> builtin provider, then 99% of the inconsistency problem is gone. We can
> remove the last 1% of the problem if we do all the work listed above."

All right. I'm not sure I totally buy the 99% number, but I take your point.

> > When an EDB customer asks "if I do X,
> > will anything break," it's often the case that answering "maybe" is
> > the same as answering "yes".
>
> That's a good point. However, note that "doesn't break primary keys" is
> a nice guarantee, even if there's still some remaining doubts about
> expression indexes, etc.

No argument.

> > They want a hard guarantee that the behavior will not
> > change.
>
> My understanding of this thread so far was that we were mostly
> concerned about internal inconsistencies of stored structures; e.g.
> indexes that could return different results than a seqscan.

I think that is true, but inconsistent indexes can be the worst
problem without being the only one.

> Not changing query results at all between major versions is a valid
> concern, but a fairly strict one that doesn't seem limited to immutable
> functions or collation issues. Surely, at least the results of "SELECT
> version()" should change from release to release ;-)

Maybe we should stop doing releases, and then users won't have to
worry about our releases breaking things!

Slightly more seriously, the use of UPPER() and LOWER() in expression
indexes is not that uncommon. Sometimes, the index exists specifically
to enforce a unique constraint. Yes, plain indexes on columns are more
common, and it makes sense to target that case first, but we shouldn't
be too quickly hand-wave away the use of case-folding functions as a
thing that doesn't happen.

> I certainly don't oppose giving users that choice. But I view it as a
> burden we are placing on the users -- better than breakage, but not
> really great, either. So if we do put in a ton of work, I'd like it if
> we could arrive at a bettter destination.
>
> If we actually want the BEST user experience possible, they'd not even
> really know that their index was ever inconsistent. Autovacuum would
> come along and just find the few entries in the index that need fixing,
> and reindex just those few tuples. In theory, it should be possible:
> there are a finite number of codepoints that change each Unicode
> version, and we can just search for them in the data and fix up derived
> structures.

I have to disagree with this. I think this is a case where fixing
something automatically is clearly worse. First, it could never fix it
instantly, so you would be stuck with some window where queries might
return wrong results -- or if you prevent that by not using the
indexes any more until they're fixed, then it would instead cause huge
query performance regressions that could easily take down the whole
system. Second, one of the things people like least about autovacuum
is when it unexpectedly does a lot of work all at once. Today, that's
usually a vacuum for wrap-around, but suddenly trying to fix all my
indexes when I wasn't expecting that to happen could easily be just as
bad. I strongly believe users want to control what happens, not have
the system try to fix it for them automatically without their
knowledge.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Re: Update Unicode data to Unicode 16.0.0 at 2025-03-19 17:39:29 from Jeff Davis

Responses

Re: Update Unicode data to Unicode 16.0.0 at 2025-03-19 21:47:44 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2025-03-19 18:40:25	Re: Orphaned users in PG16 and above can only be managed by Superusers
Previous Message	Tom Lane	2025-03-19 18:32:05	Re: Orphaned users in PG16 and above can only be managed by Superusers