Quick Links

Re: Support LIKE with nondeterministic collations

From:	Peter Eisentraut <peter(at)eisentraut(dot)org>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	Daniel Verite <daniel(at)manitou-mail(dot)org>, Paul A Jungwirth <pj(at)illuminatedcomputing(dot)com>
Subject:	Re: Support LIKE with nondeterministic collations
Date:	2024-09-16 06:26:37
Message-ID:	3104729e-1fbf-4368-ac21-1f670062de28@eisentraut.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Here is an updated patch. It is rebased over the various recent changes
in the locale APIs. No other changes.

On 30.07.24 21:46, Peter Eisentraut wrote:
> On 27.07.24 00:32, Paul A Jungwirth wrote:
>> On Thu, Jun 27, 2024 at 11:31 PM Peter Eisentraut
>> <peter(at)eisentraut(dot)org> wrote:
>>> Here is an updated patch for this.
>>
>> I took a look at this. I added some tests and found a few that give
>> the wrong result (I believe). The new tests are included in the
>> attached patch, along with the results I expect. Here are the
>> failures:
>
> Thanks, these are great test cases.
>
>>
>> -- inner %% matches b then zero:
>> SELECT U&'cb\0061\0308' LIKE U&'c%%\00E4' COLLATE ignore_accents;
>>    ?column?
>> ----------
>> - t
>> + f
>> (1 row)
>>
>> -- trailing _ matches two codepoints that form one char:
>> SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents;
>>    ?column?
>> ----------
>> - t
>> + f
>> (1 row)
>>
>> -- leading % matches zero:
>> SELECT U&'\0061\0308bc' LIKE U&'%\00E4bc' COLLATE ignore_accents;
>>    ?column?
>> ----------
>> - t
>> + f
>> (1 row)
>>
>> -- leading % matches zero (with later %):
>> SELECT U&'\0061\0308bc' LIKE U&'%\00E4%c' COLLATE ignore_accents;
>>    ?column?
>> ----------
>> - t
>> + f
>> (1 row)
>>
>> I think the 1st, 3rd, and 4th failures are all from % not backtracking
>> to match zero chars.
>
> These are all because of this in like_match.c:
>
>        * Otherwise, scan for a text position at which we can match the
>        * rest of the pattern. The first remaining pattern char is known
>        * to be a regular or escaped literal character, so we can compare
>        * the first pattern byte to each text byte to avoid recursing
>        * more than we have to. [...]
>
> This shortcut doesn't work with nondeterministic collations, so we have
> to recurse in any case here. I have fixed that in the new patch; these
> test cases work now.
>
>> The 2nd failure I'm not sure about. Maybe my expectation is wrong, but
>> then why does the same test pass with __ leading not trailing? Surely
>> they should be consistent.
>
> The question is why is
>
> SELECT U&'cb\0061\0308' LIKE U&'cb_' COLLATE ignore_accents; -- false
>
> but
>
> SELECT U&'\0061\0308bc' LIKE U&'_bc' COLLATE ignore_accents; -- true
>
> The second one matches because
>
>     SELECT U&'\0308bc' = 'bc' COLLATE ignore_accents;
>
> So the accent character will be ignored if it's adjacent to another
> fixed substring in the pattern.

Attachment	Content-Type	Size
v5-0001-Support-LIKE-with-nondeterministic-collations.patch	text/plain	25.3 KB

In response to

Re: Support LIKE with nondeterministic collations at 2024-07-30 19:46:47 from Peter Eisentraut

Responses

Re: Support LIKE with nondeterministic collations at 2024-10-29 17:15:20 from Jacob Champion

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tony Wayne	2024-09-16 06:49:27	Re: A starter task
Previous Message	Alexander Lakhin	2024-09-16 06:00:00	Re: Robocopy might be not robust enough for never-ending testing on Windows