From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Support regular expressions with nondeterministic collations |
Date: | 2024-12-18 18:36:05 |
Message-ID: | 4a1e185b7442e9f9c89be3d13aa4be148ce27b98.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Mon, 2024-12-16 at 17:16 -0500, Tom Lane wrote:
> Yeah, there is some set of collations for which that would work.
> But I think it requires nontrivial assumptions both about how
> comparison works in the collation, and whether the available
> case-folding logic matches that. An important point here is
> that the results depend on which direction you choose to smash
> case, which is at best a bit uncomfortable-making. For instance,
> I believe in German "ß" upcases to "SS" and would therefore match
> "ss" if you choose to fold to upper, but not so much if you choose
> to fold to lower. (Possibly Peter will correct me on that, but the
> point is there are some weird rules out there.)
Unicode specifies case folding separately from case conversion
(lower/title/upper) to deal with these kinds of issues: "ß", "Ss",
"SS", and "ss" all fold to "ss".
I have a couple patches that create that infrastructure:
https://www.postgresql.org/message-id/flat/a1886ddfcd8f60cb3e905c93009b646b4cfb74c5(dot)camel(at)j-davis(dot)com
https://www.postgresql.org/message-id/flat/ddfd67928818f138f51635712529bc5e1d25e4e7(dot)camel(at)j-davis(dot)com
after that's in place, we can even discuss adding a builtin case-
insensitive collation that does memcmp() on the case-folded strings.
> The existing logic in the regex engine for case-insensitive matching
> is to convert every letter to a bracket expression containing all
> its case variants. For example, "a" becomes "[aA]" and "[xY1]"
> becomes "[xXyY1]". This fails on "ß", so a better way would be
> nice...
We have a couple options:
* create more complex regexes like "(ß|[sS][sS])"
* case fold the pattern first, and then lazily case fold the string as
we match against it
The former sounds faster but the latter sounds simpler.
Regards,
Jeff Davis
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2024-12-18 18:40:06 | Re: Add CASEFOLD() function. |
Previous Message | vignesh C | 2024-12-18 18:32:34 | Re: Added schema level support for publication. |