| From: | Dominique Devienne <ddevienne(at)gmail(dot)com> |
|---|---|
| To: | Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> |
| Cc: | pgsql-general(at)postgresql(dot)org |
| Subject: | Re: Losing my latin on Ordering... |
| Date: | 2023-02-14 11:17:47 |
| Message-ID: | CAFCRh-_K09U+Zr0YY-nV7ts++WZ-XAR2kG3osNC03t6_LwA-xA@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general |
On Tue, Feb 14, 2023 at 11:23 AM Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
wrote:
> On Tue, 2023-02-14 at 10:31 +0100, Dominique Devienne wrote:
> > Hi. Porting a unit test to PostgreSQL, we got a failure related to
> ordering.
> >
> > We've distilled it to the below. The DB is en_US.UTF-8, and the sorting
> we get
> > does not make sense to me. The same prefix can be sorted differently
> based on
> > the suffix apprently, which doesn't make any sense to me.
> >
> > Surely sorting should be "constant left-to-right", no? What are we
> missing?
>
> No, it isn't. That's not how natural language collations work.
>
Honestly, who expects the same prefix to sort differently based on what
comes after, in left-to-right languages?
How does one even find out what the (capricious?) rules for sorting in a
given collation are?
I'm aware of sorting taking numerical numbers in text influencing sort, so
"Foo10" comes after "Foo9",
but that's not what we are discussing here. "Foo*" and "Foo " have no
logical relatioship, like 9 and 10 do.
> > I'm already surprised (star) comes before (space), when the latter "comes
> > before" the former in both ASCII and UTF-8, but that the two "Foo*" and
> "Foo "
> > prefixed pairs are not clustered after sorting is just mistifying to me.
> So how come?
>
> Because they compare identical on the first three levels. Any difference
> in
> letters, accents or case weighs stronger, even if it occurs to the right
> of these substrings.
>
That's completely unintuitive...
> > For now we can work-around this by explicitly adding the `collate "C"` on
> > the queries underlying that particular test, but that would be wrong in
> the
> > general case of international strings to sort, so I'd really like to
> understand
> > what's going on.
>
> Yes, it soulds like the "C" collation may be best for you. That is, if
> you don't
> mind that "Z" < "a".
>
I would mind if I asked for case-insensitive comparisons.
So the "C" collation is fine with general UTF-8 encoding?
I.e. it will be codepoint ordered OK?
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Alvaro Herrera | 2023-02-14 11:35:16 | Re: Losing my latin on Ordering... |
| Previous Message | Amit Kapila | 2023-02-14 11:02:13 | Re: Support logical replication of DDLs |