From: | Peter Eisentraut <peter(at)eisentraut(dot)org> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Support regular expressions with nondeterministic collations |
Date: | 2024-10-22 08:16:47 |
Message-ID: | 899e7b5f-b54a-4e1b-9218-bb23534fc2c4@eisentraut.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
This patch allows using regular expression functions and operators with
nondeterministic collations.
This complements the patches "Support LIKE with nondeterministic
collations" and "Support POSITION with nondeterministic collations" but
is independent. These three together fix most of the places where
nondeterministic collations are currently not allowed.
I had to decide here what the semantics should be. The SQL standard
doesn't say anything, it just refers to XQuery. XQuery has no knowledge
of SQL collations. I also studied the relevant Unicode standard (UTS
#18) and it makes no mention of collations. So my conclusion is that
regular expressions should pay no attention to collations. That makes
it easy.
To clarify a bit more: They don't pay attention to the collate part of
collations. So if you have an accent-insensitive collation, that
doesn't make the regular expression match accent-insensitive. But it
does and continues to pay attention to the ctype part of collations.
The latter is a PostgreSQL extension.
Note that UTS #18 has "retracted" support for tailoring in regular
expressions, which supports the idea that regular expressions should be
independent of things like language settings.
I think this is sensible. Regular expressions are inherently based on
sequences of characters, and trying to marry that with nondeterministic
collations just doesn't fit.
But: We also convert SIMILAR TO patterns to standard regular
expressions, and SIMILAR TO is covered in the SQL standard. And the
definition there does take the collation into account. But the
definition there is pretty much impossible to implement for
nondeterministic collations: It basically says, the predicate is true
if the string to be matched is equal, using the applicable collation, to
any of the strings in the set of strings described by the regular
expression. Which is a nice computer-sciency way to define it, but it
doesn't work in practice.
So I need a way to remember whether a regular expression was originally
a SIMILAR TO pattern and then error out if the collation is
nondeterministic. I figured out a way to do that: Regular expressions
support prefixes like "***X", where X is some character. I added a new
prefix "***S". This is not externally visible, it just gets used
internally, and it doesn't conflict with real regular expressions.
In summary, this patch doesn't change any functionality that currently
works. It just removes one error message and lets regular expressions
just run, independent of whether the collation is nondeterministic.
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Support-regular-expressions-with-nondeterministic.patch | text/plain | 8.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2024-10-22 08:32:41 | Re: Fix C23 compiler warning |
Previous Message | Alexander Korotkov | 2024-10-22 07:34:15 | Re: type cache cleanup improvements |