Re: Support regular expressions with nondeterministic collations

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support regular expressions with nondeterministic collations
Date: 2024-12-18 18:36:05
Message-ID: 4a1e185b7442e9f9c89be3d13aa4be148ce27b98.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 2024-12-16 at 17:16 -0500, Tom Lane wrote:
> Yeah, there is some set of collations for which that would work.
> But I think it requires nontrivial assumptions both about how
> comparison works in the collation, and whether the available
> case-folding logic matches that.  An important point here is
> that the results depend on which direction you choose to smash
> case, which is at best a bit uncomfortable-making.  For instance,
> I believe in German "ß" upcases to "SS" and would therefore match
> "ss" if you choose to fold to upper, but not so much if you choose
> to fold to lower.  (Possibly Peter will correct me on that, but the
> point is there are some weird rules out there.)

Unicode specifies case folding separately from case conversion
(lower/title/upper) to deal with these kinds of issues: "ß", "Ss",
"SS", and "ss" all fold to "ss".

I have a couple patches that create that infrastructure:

https://www.postgresql.org/message-id/flat/a1886ddfcd8f60cb3e905c93009b646b4cfb74c5(dot)camel(at)j-davis(dot)com
https://www.postgresql.org/message-id/flat/ddfd67928818f138f51635712529bc5e1d25e4e7(dot)camel(at)j-davis(dot)com

after that's in place, we can even discuss adding a builtin case-
insensitive collation that does memcmp() on the case-folded strings.

> The existing logic in the regex engine for case-insensitive matching
> is to convert every letter to a bracket expression containing all
> its case variants.  For example, "a" becomes "[aA]" and "[xY1]"
> becomes "[xXyY1]".  This fails on "ß", so a better way would be
> nice...

We have a couple options:

* create more complex regexes like "(ß|[sS][sS])"
* case fold the pattern first, and then lazily case fold the string as
we match against it

The former sounds faster but the latter sounds simpler.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2024-12-18 18:40:06 Re: Add CASEFOLD() function.
Previous Message vignesh C 2024-12-18 18:32:34 Re: Added schema level support for publication.