Re: BUG #18523: String compare not consistent

From: Peter Eisentraut <peter(at)eisentraut(dot)org>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>, patrick(dot)van(dot)dijk(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #18523: String compare not consistent
Date: 2024-06-26 18:30:56
Message-ID: 30c9e733-9573-4500-b191-5a8a67b3795f@eisentraut.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 26.06.24 17:59, Laurenz Albe wrote:
> Interestingly, glibc and ICU disagree about that:
>
> SELECT '|1' < '01' COLLATE "de_DE.utf8" AS glibc,
> '|1' < '01' COLLATE "de-AT-x-icu" AS icu;
>
> glibc │ icu
> ═══════╪═════
> f │ t
> (1 row)

Let's work this out.

The collation weights of the characters involved are:

0030 ; [.209F.0020.0002] # DIGIT ZERO
0031 ; [.20A0.0020.0002] # DIGIT ONE
007C ; [*06AF.0020.0002] # VERTICAL LINE

So under the ICU default "noignore" behavior, the sort keys are:

'|1' => 06AF 20A0 0000 0020 0020 0000 0002 0002
'01' => 209F 20A0 0000 0020 0020 0000 0002 0002

Therefore, '|1' < '01'.

Under "shifted", the effective collation weights are:

0030 ; [.209F.0020.0002.FFFF] # DIGIT ZERO
0031 ; [.20A0.0020.0002.FFFF] # DIGIT ONE
007C ; [.0000.0000.0000.06AF] # VERTICAL LINE

and the sort keys are:

'|1' => 20A0 0000 0020 0000 0002 0000 06AF FFFF
'01' => 209F 20A0 0000 0020 0020 0000 0002 0002 0000 FFFF FFFF

Therefore, '|1' > '01'. That is what you get in glibc. (Older glibc
use "shift-trimmed" behavior, which trims the trailing FFFF's, which
doesn't affect the result in this case.)

Under ICU (or rather CLDR), the '|' character is not actually a variable
collation element by default, so you actually get the same result if you
ask for "noignore" or "shifted" behavior. But you can make symbols
variable with 'und-u-ka-shifted-kv-symbol', and then you will also get
the same result as glibc.

Further reading:
-
https://peter.eisentraut.org/blog/2023/04/12/how-collation-of-punctuation-and-whitespace-works
-
https://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Heikki Linnakangas 2024-06-26 20:58:13 Re: BUG #18146: Rows reappearing in Tables after Auto-Vacuum Failure in PostgreSQL on Windows
Previous Message Tom Lane 2024-06-26 16:04:04 Re: BUG #18523: String compare not consistent