Re: Support LIKE with nondeterministic collations

From: Peter Eisentraut <peter(at)eisentraut(dot)org>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Pgsql-Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Support LIKE with nondeterministic collations
Date: 2024-05-03 18:53:52
Message-ID: b32cefe2-b9e2-499e-b919-fe8f21c5bc22@eisentraut.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03.05.24 16:58, Daniel Verite wrote:
> * Generating bounds for a sort key (prefix matching)
>
> Having sort keys for strings allows for easy creation of bounds -
> sort keys that are guaranteed to be smaller or larger than any sort
> key from a give range. For example, if bounds are produced for a
> sortkey of string “smith”, strings between upper and lower bounds
> with one level would include “Smith”, “SMITH”, “sMiTh”. Two kinds
> of upper bounds can be generated - the first one will match only
> strings of equal length, while the second one will match all the
> strings with the same initial prefix.
>
> CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with
> the maximum primary weight, so that for example the string
> “smith\uFFFF” can be used as the upper bound rather than modifying
> the sort key for “smith”.
>
> In other words it says that
>
> col LIKE 'smith%' collate "nd"
>
> is equivalent to:
>
> col >= 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"
>
> which could be obtained from an index scan, assuming a btree
> index on "col" collate "nd".
>
> U+FFFF is a valid code point but a "non-character" [1] so it's
> not supposed to be present in normal strings.

Thanks, this could be very useful!

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Cary Huang 2024-05-03 18:55:01 Re: Support tid range scan in parallel?
Previous Message Peter Eisentraut 2024-05-03 18:44:42 Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation