Re: Upgrading locale issues

From: rihad <rihad(at)mail(dot)ru>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, pgsql-general General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Upgrading locale issues
Date: 2019-05-02 17:45:25
Message-ID: 352ac37a-39e6-5afe-d822-b4e031b9b314@mail.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 05/02/2019 05:36 PM, Daniel Verite wrote:
> rihad wrote:
>
>> Thanks for the reply. Do you know what would a "decent" ICU collation be
>> to bind to a field's schema definition so it would mimic a UTF-8
>> encoding for a multilingual column? Maybe und-x-icu? We aren't as much
>> concerned about their sortability in most cases, we just want indexes to
>> better handle future PG/ICU upgrades. But what does und(efined) even
>> mean with respect to collations?
> "undefined" in this context means unspecified language and
> unspecified country or region. It implies that no language-specific
> nor regional rule will be applied to compare strings.
>
> Using C.UTF-8 as the collation for text fields to index may be the
> best trade-off in your case. It should be immune to libc and ICU
> upgrades.
>
> With C.UTF-8, a string like 'BC' will sort before 'ab', and punctuation
> and accents will also sort differently than with a linguistic-aware
> collation.
Thanks, I'm a bit confused here. AFAIK indexes are used for at least two
things: for speed and for skipping the ORDER BY step (since btree
indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index
still work correctly for table lookups? And can the existing 
en_US.UTF-8 fields' definition be altered in place, without a
dump+restore? en_US.UTF-8 is the default encoding+locale+collation, it
isn't set explicitly for any of our string columns. I assume there's
some "catch-all" ordering taking place even for the C locale, so there
won't be any bizarre things like b coming before a, or generally for any
language, the second letter of its alphabet coming before the first?

> If your applications care about that, it can be fixed by simply
> adding COLLATE "default" to the ORDER BY clause of the queries that
> are meant to present data to users.
> COLLATE "default" means the collation of the database, which
> presumably would be something like "language_REGION.UTF-8" in your
> case. If you never specified it explicitly, it came from initdb which
> itself got it from the environment of the server.
>
>
> Best regards,

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ray Cote 2019-05-02 18:22:46 Debugging Failed Startup
Previous Message Sabit Nepal 2019-05-02 17:14:31 Postgres using inefficient query plan when using OR filter, uses correct indices when using IN clause