Re: [18] Unintentional behavior change in commit e9931bfb75

From: Noah Misch <noah(at)leadboat(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [18] Unintentional behavior change in commit e9931bfb75
Date: 2025-04-14 20:44:08
Message-ID: 20250414204408.5d.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 14, 2025 at 12:59:50PM -0700, Jeff Davis wrote:
> On Sat, 2025-04-12 at 05:34 -0700, Noah Misch wrote:
> > I think the code for (2) and for "I/i in Turkish" haven't returned. 
> > Given
> > commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain
> > lower(),
> > the regex code likely needs a similar restoration.  If not, the regex
> > comments
> > would need to change to match the code.
>
> Great find, thank you! I'm curious how you came about this difference,
> was it through testing or code inspection?

Code inspection. I saw commit e9931bf remove "per comments above" references
without editing the referenced "comments above".

> --- a/src/backend/regex/regc_pg_locale.c
> +++ b/src/backend/regex/regc_pg_locale.c
> @@ -21,9 +21,10 @@
> #include "utils/pg_locale.h"
>
> /*
> - * To provide as much functionality as possible on a variety of platforms,
> - * without going so far as to implement everything from scratch, we use
> - * several implementation strategies depending on the situation:
> + * For the libc provider, to provide as much functionality as possible on a
> + * variety of platforms without going so far as to implement everything from
> + * scratch, we use several implementation strategies depending on the
> + * situation:
> *
> * 1. In C/POSIX collations, we use hard-wired code. We can't depend on
> * the <ctype.h> functions since those will obey LC_CTYPE. Note that these
> @@ -33,8 +34,9 @@
> *
> * 2a. When working in UTF8 encoding, we use the <wctype.h> functions.
> * This assumes that every platform uses Unicode codepoints directly
> - * as the wchar_t representation of Unicode. On some platforms
> - * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF.
> + * as the wchar_t representation of Unicode. (XXX: This could be a problem
> + * for ICU in non-UTF8 encodings.) On some platforms wchar_t is only 16 bits
> + * wide, so we have to punt for codepoints > 0xFFFF.
> *
> * 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar
> * values up to 255, and punt for values above that. This is 100% correct

This leaves outdated material in the big comment edited here. Most prominent:

* 2. In the "default" collation (which is supposed to obey LC_CTYPE):
...
* 3. Here, we use the locale_t-extended forms of the <wctype.h> and <ctype.h>
* functions, under exactly the same cases as #2.

v18 regc_pg_locale.c uses only the locale_t-extended forms, and it aims not to
depend on LC_CTYPE. v17 used both tolower() and tolower_l(), but v18 uses the
latter only.

> @@ -562,10 +564,16 @@ pg_wc_toupper(pg_wchar c)
> case PG_REGEX_STRATEGY_BUILTIN:
> return unicode_uppercase_simple(c);
> case PG_REGEX_STRATEGY_LIBC_WIDE:
> + /* force C behavior for ASCII characters, per comments above */
> + if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
> + return pg_ascii_toupper((unsigned char) c);

v17 has distinct symbols PG_REGEX_LOCALE_WIDE (default-locale case) and
PG_REGEX_LOCALE_WIDE_L (locale_t) case. Consolidating them looked attractive
when this is_default case was going away. Now that is_default will remain a
factor, I don't think overloading PG_REGEX_STRATEGY_LIBC_WIDE is working out
well. The "_L" suffix no longer captures the distinction, since there's no
case that uses tolower() as opposed to tolower_l(). However, I think v17's
concept of separate PG_REGEX_ symbols for the default-locale case is still the
right thing for v18. In other words, this code should change to look more
like v17, with the removal of non-locale_t calls being the main change.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2025-04-14 20:49:39 Re: Fix a resource leak (src/backend/utils/adt/rowtypes.c)
Previous Message Jacob Champion 2025-04-14 20:39:52 Re: Adding support for SSLKEYLOGFILE in the frontend