Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-07-01 19:24:15
Message-ID: 1ecfeb4d2b1b7f119fa917d3052a3aecfaf4f425.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, 2024-06-29 at 15:08 -0700, Noah Misch wrote:
> lower(), initcap(), upper(), and regexp_matches() are
> PROVOLATILE_IMMUTABLE.
> Until now, we've delegated that responsibility to the user.  The user
> is
> supposed to somehow never update libc or ICU in a way that changes
> outcomes
> from these functions.

To me, "delegated" connotes a clear and organized transfer of
responsibility to the right person to solve it. In that sense, I
disagree that we've delegated it.

What's happened here is evolution of various choices that seemed
reasonable at the time. Unfortunately, the consequences that are hard
for us to manage and even harder for users to manage themselves.

>   Now that postgresql.org is taking that responsibility
> for builtin C.UTF-8, how should we govern it?  I think the above text
> and [1]
> convey that we'll update the Unicode data between major versions,
> making
> functions like lower() effectively STABLE.  Is that right?

Marking them STABLE is not a viable option, that would break a lot of
valid use cases, e.g. an index on LOWER().

Unicode already has its own governance, including a stability policy
that includes case mapping:

https://www.unicode.org/policies/stability_policy.html#Case_Pair

Granted, that policy does not guarantee that the results will never
change. In particular, the results can change if using unassinged code
poitns that are later assigned to Cased characters.

That's not terribly common though; for instance, there are zero changes
in uppercase/lowercase behavior between Unicode 14.0 (2021) and 15.1
(current) -- even for code points that were unassigned in 14.0 and
later assigned. I checked this by modifying case_test.c to look at
unassigned code points as well.

There's a greater chance that character properties can change (e.g.
whether a character is "alphabetic" or not) in new releases of Unicode.
Such properties can affect regex character classifications, and in some
cases the results of initcap (because it uses the "alphanumeric"
classification to determine word boundaries).

I don't think we need code changes for 17. Some documentation changes
might be helpful, though. Should we have a note around LOWER()/UPPER()
that users should REINDEX any dependent indexes when the provider is
updated?

> (This thread had some discussion[2] that datcollversion/collversion
> won't
> necessarily change when a major versions changes lower() behavior.)

datcollversion/collversion track the vertsion of the collation
specifically (text ordering only), not the ctype (character semantics).
When using the libc provider, get_collation_actual_version() completely
ignores the ctype.

It would be interesting to consider tracking the versions separately,
though.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-07-01 19:46:56 Re: optimizing pg_upgrade's once-in-each-database steps
Previous Message Daniel Gustafsson 2024-07-01 19:19:59 Re: Avoid incomplete copy string (src/backend/access/transam/xlog.c)