Re: Built-in CTYPE provider

From: Noah Misch <noah(at)leadboat(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-07-04 21:26:41
Message-ID: 20240704212641.c4.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 03, 2024 at 02:19:07PM -0700, Jeff Davis wrote:
> * Unless I made a mistake, the last three releases of Unicode (14.0,
> 15.0, and 15.1) all have the exact same behavior for UPPER() and
> LOWER() -- even for unassigned code points. It would be silly to
> promise to stay with 15.1 and then realize that moving to 16.0 doesn't
> create any actual problem.

I think you're saying that if some Unicode update changes the results of a
STABLE function but does not change the result of any IMMUTABLE function, we
may as well import that update. Is that about right? If so, I agree.

In addition to the options I listed earlier (error in pg_upgrade or document
that IMMUTABLE stands) I would be okay with a third option. Decide here that
we'll not adopt a Unicode update in a way that changes a v17 IMMUTABLE
function result of the new provider. We don't need to write that in the
documentation, since it's implicit in IMMUTABLE. Delete the "stable within a
<productname>Postgres</productname> major version" documentation text.

> * While someone can pin libc+ICU to particular versions, it's
> impossible when using the official packages, and additionally requires
> using something like [1], which just became available last year. I
> don't think it's reasonable to put it forth as a matter-of-fact
> solution.
>
> * Let's keep some perspective: we've lived for a long time with ALL
> text indexes at serious risk of breakage. In contrast, the concerns you
> are raising now are about certain kinds of expression indexes over data
> containing certain unassigned code points. I am not dismissing that
> concern, but the builtin provider moves us in the right direction and
> let's not lose sight of that.

I see you're trying to help users get less breakage, and that's a good goal.
I agree $SUBJECT eliminates libc+ICU breakage, and libc+ICU breakage has hurt
plenty. However, you proposed to update Unicode data and give REINDEX as the
solution to breakage this causes. Unlike libc+ICU breakage, the packager has
no escape from that. That's a different kind of breakage proposition, and no
new PostgreSQL feature should do that. It's on a different axis from helping
users avoid libc+ICU breakage, and a feature doesn't get to credit helping on
one axis against a regression on the other axis. What am I missing here?

> Given that no code changes for v17 are proposed, I suggest that we
> refrain from making any declarations until the next version of Unicode
> is released. If the pattern holds, that will be around September, which
> still leaves time to make reasonable decisions for v18.

Soon enough, a Unicode release will add one character to regexp [[:alpha:]].
PostgreSQL will then need to decide what IMMUTABLE is going to mean. How does
that get easier in September?

Thanks,
nm

> [1] https://github.com/awslabs/compat-collation-for-glibc

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2024-07-04 21:51:40 Re: Wrong results with grouping sets
Previous Message Andres Freund 2024-07-04 21:08:25 Re: Pluggable cumulative statistics