Support regular expressions with nondeterministic collations

From: Peter Eisentraut <peter(at)eisentraut(dot)org>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Support regular expressions with nondeterministic collations
Date: 2024-10-22 08:16:47
Message-ID: 899e7b5f-b54a-4e1b-9218-bb23534fc2c4@eisentraut.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This patch allows using regular expression functions and operators with
nondeterministic collations.

This complements the patches "Support LIKE with nondeterministic
collations" and "Support POSITION with nondeterministic collations" but
is independent. These three together fix most of the places where
nondeterministic collations are currently not allowed.

I had to decide here what the semantics should be. The SQL standard
doesn't say anything, it just refers to XQuery. XQuery has no knowledge
of SQL collations. I also studied the relevant Unicode standard (UTS
#18) and it makes no mention of collations. So my conclusion is that
regular expressions should pay no attention to collations. That makes
it easy.

To clarify a bit more: They don't pay attention to the collate part of
collations. So if you have an accent-insensitive collation, that
doesn't make the regular expression match accent-insensitive. But it
does and continues to pay attention to the ctype part of collations.
The latter is a PostgreSQL extension.

Note that UTS #18 has "retracted" support for tailoring in regular
expressions, which supports the idea that regular expressions should be
independent of things like language settings.

I think this is sensible. Regular expressions are inherently based on
sequences of characters, and trying to marry that with nondeterministic
collations just doesn't fit.

But: We also convert SIMILAR TO patterns to standard regular
expressions, and SIMILAR TO is covered in the SQL standard. And the
definition there does take the collation into account. But the
definition there is pretty much impossible to implement for
nondeterministic collations: It basically says, the predicate is true
if the string to be matched is equal, using the applicable collation, to
any of the strings in the set of strings described by the regular
expression. Which is a nice computer-sciency way to define it, but it
doesn't work in practice.

So I need a way to remember whether a regular expression was originally
a SIMILAR TO pattern and then error out if the collation is
nondeterministic. I figured out a way to do that: Regular expressions
support prefixes like "***X", where X is some character. I added a new
prefix "***S". This is not externally visible, it just gets used
internally, and it doesn't conflict with real regular expressions.

In summary, this patch doesn't change any functionality that currently
works. It just removes one error message and lets regular expressions
just run, independent of whether the collation is nondeterministic.

Attachment Content-Type Size
v1-0001-Support-regular-expressions-with-nondeterministic.patch text/plain 8.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-10-22 08:32:41 Re: Fix C23 compiler warning
Previous Message Alexander Korotkov 2024-10-22 07:34:15 Re: type cache cleanup improvements