From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
Cc: | Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Support regular expressions with nondeterministic collations |
Date: | 2024-12-16 22:16:11 |
Message-ID: | 2029316.1734387371@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> On Tue, 2024-10-22 at 10:40 -0400, Tom Lane wrote:
>> I understand and agree with your conclusion
>> that it's pretty much impossible to do what the SQL standard suggests
>> should happen --- but maybe we're both missing something that would
>> make it feasible.
> It sounds feasible for case-insensitive collations, right? We just
> casefold the pattern and the string, and then check for a match.
Yeah, there is some set of collations for which that would work.
But I think it requires nontrivial assumptions both about how
comparison works in the collation, and whether the available
case-folding logic matches that. An important point here is
that the results depend on which direction you choose to smash
case, which is at best a bit uncomfortable-making. For instance,
I believe in German "ß" upcases to "SS" and would therefore match
"ss" if you choose to fold to upper, but not so much if you choose
to fold to lower. (Possibly Peter will correct me on that, but the
point is there are some weird rules out there.)
The existing logic in the regex engine for case-insensitive matching
is to convert every letter to a bracket expression containing all
its case variants. For example, "a" becomes "[aA]" and "[xY1]"
becomes "[xXyY1]". This fails on "ß", so a better way would be
nice...
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Nathan Bossart | 2024-12-16 22:18:26 | Re: Crash: invalid DSA memory alloc request |
Previous Message | Nathan Bossart | 2024-12-16 22:02:56 | Re: Track the amount of time waiting due to cost_delay |