Quick Links

Re: Update Unicode data to Unicode 16.0.0

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Jeremy Schneider <schneider(at)ardentperf(dot)com>
Cc:	Joe Conway <mail(at)joeconway(dot)com>, Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Update Unicode data to Unicode 16.0.0
Date:	2025-03-18 18:54:46
Message-ID:	b7c9dafa10ba3ad7fa201d9c3a2d8ac5b7aa923d.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, 2025-03-18 at 09:28 -0700, Jeremy Schneider wrote:
> We think case-insensitive indexes are probably uncommon, so as
> long as its "rare" we can let them break.

Let's define "break" in this context to mean that the constraints are
not enforced, or that the query doesn't return the results that the
user is expecting.

Let's say a user has an index on LOWER(t) in PG17 (Unicode 15.1). Then
Unicode 16.0 comes out, introducing the newly-assigned U+A7DC, which
lowercases to U+019B. The rest of the world moves on and starts using
U+A7DC.

There are only two ways that Postgres can prevent breakage:

1. Update the database to Unicode 16.0 before U+A7DC is encountered, so
that it's properly lowercased to U+019B, and a query on LOWER(t) =
U&'\019B' will correctly return the record containing it.

2. Prevent U+A7DC from going into the database at all.

Continuing on with Unicode 15.1 and accepting the unassigned code point
*cannot* prevent breakage.

A truly paranoid user would want a combination of both solutions:
regular Unicode updates; and something like STRICT_UNICODE
( https://commitfest.postgresql.org/patch/4876/ ) to protect the user
between the time Unicode assigns the code point and the time they can
deploy a version of Postgres that understands it.

You are rightfully concerned that updating Unicode can create its own
inconsistencies, and if nothing is done that can lead to breakage as
well. The upgrade-time check in this thread is one solution to that
problem, but we could do a lot more.

You are also right that we should be more skeptical of an internal
inconsistency (e.g. different results for seqscan vs indexscan) than a
wider definition of inconsistency. But the user created a unicode-based
case-folded index there for a reason, and we shouldn't lose sight of
that.

> I'm not asking for an extreme definition of "IMMUTABLE" but I'd be
> very happy with a GUC "data_safety=radical_like_jeremy" where
> Postgres
> simply won't start if the control file says it was from a different
> operating system or architecture or ICU/glibc collation version. I
> can
> disable the GUC (like a maintenance mode) to rebuild my indexes and
> update my collation versions, and ideally this GUC would also mean
> that
> indexes simply aren't allowed to be created on functions that might
> change within the guarantees that are made. (And range-based
> partitions
> can't use them, and FDWs can't rely on them for query planning, etc.)

Does the upgrade check patch in this thread accomplish that for you? If
not, what else does it need?

It's an upgrade-time check rather than a GUC, but it basically seems to
match what you want. See:

https://www.postgresql.org/message-id/16c4e37d4c89e63623b009de9ad6fb90e7456ed8.camel@j-davis.com

Regards,
Jeff Davis

In response to

Re: Update Unicode data to Unicode 16.0.0 at 2025-03-18 16:28:39 from Jeremy Schneider

Responses

Re: Update Unicode data to Unicode 16.0.0 at 2025-03-18 18:58:05 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Robert Haas	2025-03-18 18:58:05	Re: Update Unicode data to Unicode 16.0.0
Previous Message	Sami Imseih	2025-03-18 18:54:18	Re: pg_stat_statements and "IN" conditions