Re: Support LIKE with nondeterministic collations

From: Peter Eisentraut <peter(at)eisentraut(dot)org>
To: Paul A Jungwirth <pj(at)illuminatedcomputing(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Subject: Re: Support LIKE with nondeterministic collations
Date: 2024-07-30 19:46:47
Message-ID: ba4dc9f5-fb14-433f-8d53-3dd291585161@eisentraut.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 27.07.24 00:32, Paul A Jungwirth wrote:
> On Thu, Jun 27, 2024 at 11:31 PM Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
>> Here is an updated patch for this.
>
> I took a look at this. I added some tests and found a few that give
> the wrong result (I believe). The new tests are included in the
> attached patch, along with the results I expect. Here are the
> failures:

Thanks, these are great test cases.

>
> -- inner %% matches b then zero:
> SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
> ?column?
> ----------
> - t
> + f
> (1 row)
>
> -- trailing _ matches two codepoints that form one char:
> SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
> ?column?
> ----------
> - t
> + f
> (1 row)
>
> -- leading % matches zero:
> SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
> ?column?
> ----------
> - t
> + f
> (1 row)
>
> -- leading % matches zero (with later %):
> SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
> ?column?
> ----------
> - t
> + f
> (1 row)
>
> I think the 1st, 3rd, and 4th failures are all from % not backtracking
> to match zero chars.

These are all because of this in like_match.c:

* Otherwise, scan for a text position at which we can match the
* rest of the pattern. The first remaining pattern char is known
* to be a regular or escaped literal character, so we can compare
* the first pattern byte to each text byte to avoid recursing
* more than we have to. [...]

This shortcut doesn't work with nondeterministic collations, so we have
to recurse in any case here. I have fixed that in the new patch; these
test cases work now.

> The 2nd failure I'm not sure about. Maybe my expectation is wrong, but
> then why does the same test pass with __ leading not trailing? Surely
> they should be consistent.

The question is why is

SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents; -- false

but

SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents; -- true

The second one matches because

SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;

So the accent character will be ignored if it's adjacent to another
fixed substring in the pattern.

Attachment Content-Type Size
v4-0001-Support-LIKE-with-nondeterministic-collations.patch text/plain 25.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2024-07-30 19:47:16 Re: meson vs windows perl
Previous Message Laurenz Albe 2024-07-30 19:46:46 Re: proposal: schema variables