Re: [18] Policy on IMMUTABLE functions and Unicode updates

From: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeremy Schneider <schneider(at)ardentperf(dot)com>
Subject: Re: [18] Policy on IMMUTABLE functions and Unicode updates
Date: 2024-07-19 19:06:57
Message-ID: 5fcd0b4ae0d90c4df3e05266f0859f9751defe46.camel@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2024-07-16 at 10:42 -0700, Jeff Davis wrote:
> The IMMUTABLE marker for functions is quite simple on the surface, but
> could be interpreted a few different ways, and there's some historical
> baggage that makes it complicated.
>
> There are a number of ways in which IMMUTABLE functions can change
> behavior:
>
> 1. Updating or moving to a different OS affects all collations that use
> the libc provider (other than "C" and "POSIX", which don't actually use
> libc). LOWER(), INITCAP(), UPPER() and pattern matching are also
> affected.
>
> 2. Updating ICU affects the collations that use the ICU provider.
> ICU_UNICODE_VERSION(), LOWER(), INITCAP(), UPPER() and pattern matching
> are also affected.
>
> 3. Moving to a different database encoding may affect collations that
> use the "C" or "POSIX" locales in the libc provider (NB: those locales
> don't actually use libc).
>
> 4. A PG Unicode update may change the results of functions that depend
> on Unicode. For instance, NORMALIZE(), UNICODE_ASSIGNED(), and
> UNICODE_VERSION(). Or, if using the new builtin provider's "C.UTF-8"
> locale in version 17, LOWER(), INITCAP(), UPPER(), and pattern matching
> (NB: collation itself is not affected -- always code point order).
>
> 5. If a well-defined IMMUTABLE function produces the wrong results, we
> may fix the bug in the next major release.
>
> 6. The GUC extra_float_digits can change the results of floating point
> text output.
>
> 7. A UDF may be improperly marked IMMUTABLE. A particularly common
> variant is a UDF without search_path specified, which is probably not
> truly IMMUTABLE.
>
> Noah seemed particularly concerned[1] about #4, so I'll start off by
> discussing that.
>
> Unicode updates do not affect collation itself, they
> affect affect NORMALIZE(), UNICODE_VERSION(), and UNICODE_ASSIGNED().
> If using the builtin "C.UTF-8" locale, they also affect LOWER(),
> INITCAP(), UPPER(), and pattern matching. (NB: the builtin collation
> provider hasn't yet gone through any Unicode update.)
>
> There are two alternative philosophies:
>
> A. By choosing to use a Unicode-based function, the user has opted in
> to the Unicode stability guarantees[2], and it's fine to update Unicode
> occasionally in new major versions as long as we are transparent with
> the user.
>
> B. IMMUTABLE implies some very strict definition of stability, and we
> should never again update Unicode because it changes the results of
> IMMUTABLE functions.
>
> We've been following (A), and that's the defacto policy today[3][4].
> Noah and Laurenz argued[5] that the policy starting in version 18
> should be (B). Given that it's a policy decision that affects more than
> just the builtin collation provider, I'd like to discuss it more
> broadly outside of that subthread.
>
> [1] 
> https://www.postgresql.org/message-id/20240629220857.fb.nmisch@google.com
>
> [2]
> https://www.unicode.org/policies/stability_policy.html
>
> [3] 
> https://www.postgresql.org/message-id/1d178eb1bbd61da1bcfe4a11d6545e9cdcede1d1.camel%40j-davis.com
>
> [4]
> https://www.postgresql.org/message-id/564325.1720297161%40sss.pgh.pa.us
>
> [5]
> https://www.postgresql.org/message-id/af82b292f13dd234790bc701933e9992ee07d4fa.camel%40cybertec.at

Concerning #4, the new built-in locale, my hope (and, in my opinion, its only
value) is to get out of the problems #1 and #2 that are not under our control.

If changes in major PostgreSQL versions force users of the built-in
locale provider to rebuild indexes, that would invalidate it. I think that
users care more about data corruption than about exact Unicode-compliant
behavior. Anybody who does can use ICU.

People routinely create indexes that involve upper() or lower(), so I'd
say changing their behavior would be a problem.

Perhaps I should moderate my statement: if a change affects only a newly
introduced code point (which is unlikely to be used in a database), and we
think that the change is very important, we could consider applying it.
But that should be carefully considered; I am against blindly following the
changes in Unicode.

Yours,
Laurenz Albe

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2024-07-19 19:16:46 Re: Add privileges test for pg_stat_statements to improve coverage
Previous Message Nathan Bossart 2024-07-19 18:45:49 Re: Remove dependence on integer wrapping