Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-07-03 21:19:07
Message-ID: db496682c6656ac64433f05f8821e561bbf4d105.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2024-07-02 at 16:03 -0700, Noah Misch wrote:
> Each packager can choose their dependencies so the v16 providers
> don't have
> the problem.  With the $SUBJECT provider, a packager won't have that
> option.

While nothing needs to be changed for 17, I agree that we may need to
be careful in future releases not to break things.

Broadly speaking, you are right that we may need to freeze Unicode
updates or be more precise about versioning. But there's a lot of
nuance to the problem, so I don't think we should pre-emptively promise
either of those things right now.

Consider:

* Unless I made a mistake, the last three releases of Unicode (14.0,
15.0, and 15.1) all have the exact same behavior for UPPER() and
LOWER() -- even for unassigned code points. It would be silly to
promise to stay with 15.1 and then realize that moving to 16.0 doesn't
create any actual problem.

* Unicode also offers "case folding", which has even stronger stability
guarantees, and I plan to propose that soon. When implemented, it would
be preferred over LOWER()/UPPER() in index expressions for most use
cases.

* While someone can pin libc+ICU to particular versions, it's
impossible when using the official packages, and additionally requires
using something like [1], which just became available last year. I
don't think it's reasonable to put it forth as a matter-of-fact
solution.

* Let's keep some perspective: we've lived for a long time with ALL
text indexes at serious risk of breakage. In contrast, the concerns you
are raising now are about certain kinds of expression indexes over data
containing certain unassigned code points. I am not dismissing that
concern, but the builtin provider moves us in the right direction and
let's not lose sight of that.

Given that no code changes for v17 are proposed, I suggest that we
refrain from making any declarations until the next version of Unicode
is released. If the pattern holds, that will be around September, which
still leaves time to make reasonable decisions for v18.

Regards,
Jeff Davis

[1] https://github.com/awslabs/compat-collation-for-glibc

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-07-03 21:46:24 Re: Incorrect Assert in BufFileSize()?
Previous Message Joel Jacobson 2024-07-03 20:45:34 Re: Optimize numeric multiplication for one and two base-NBASE digit multiplicands.