Re: encoding affects ICU regex character classification

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: encoding affects ICU regex character classification
Date: 2023-11-29 23:56:04
Message-ID: 360857.1701302164@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> The problem seems to be confusion between pg_wchar and a unicode code
> point in pg_wc_isalpha() and related functions.

Yeah, that's an ancient sore spot: we don't really know what the
representation of wchar is. We assume it's Unicode code points
for UTF8 locales, but libc isn't required to do that AFAIK. See
comment block starting about line 20 in regc_pg_locale.c.

I doubt that ICU has much to do with this directly.

We'd have to find an alternate source of knowledge to replace the
<wctype.h> functions if we wanted to fix it fully ... can ICU do that?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2023-11-29 23:58:45 Re: logical decoding and replication of sequences, take 2
Previous Message Jeff Davis 2023-11-29 23:46:26 encoding affects ICU regex character classification