From: | rihad <rihad(at)mail(dot)ru> |
---|---|
To: | Daniel Verite <daniel(at)manitou-mail(dot)org> |
Cc: | Peter Geoghegan <pg(at)bowt(dot)ie>, pgsql-general General <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Upgrading locale issues |
Date: | 2019-05-02 17:45:25 |
Message-ID: | 352ac37a-39e6-5afe-d822-b4e031b9b314@mail.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 05/02/2019 05:36 PM, Daniel Verite wrote:
> rihad wrote:
>
>> Thanks for the reply. Do you know what would a "decent" ICU collation be
>> to bind to a field's schema definition so it would mimic a UTF-8
>> encoding for a multilingual column? Maybe und-x-icu? We aren't as much
>> concerned about their sortability in most cases, we just want indexes to
>> better handle future PG/ICU upgrades. But what does und(efined) even
>> mean with respect to collations?
> "undefined" in this context means unspecified language and
> unspecified country or region. It implies that no language-specific
> nor regional rule will be applied to compare strings.
>
> Using C.UTF-8 as the collation for text fields to index may be the
> best trade-off in your case. It should be immune to libc and ICU
> upgrades.
>
> With C.UTF-8, a string like 'BC' will sort before 'ab', and punctuation
> and accents will also sort differently than with a linguistic-aware
> collation.
Thanks, I'm a bit confused here. AFAIK indexes are used for at least two
things: for speed and for skipping the ORDER BY step (since btree
indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index
still work correctly for table lookups? And can the existingÂ
en_US.UTF-8 fields' definition be altered in place, without a
dump+restore? en_US.UTF-8 is the default encoding+locale+collation, it
isn't set explicitly for any of our string columns. I assume there's
some "catch-all" ordering taking place even for the C locale, so there
won't be any bizarre things like b coming before a, or generally for any
language, the second letter of its alphabet coming before the first?
> If your applications care about that, it can be fixed by simply
> adding COLLATE "default" to the ORDER BY clause of the queries that
> are meant to present data to users.
> COLLATE "default" means the collation of the database, which
> presumably would be something like "language_REGION.UTF-8" in your
> case. If you never specified it explicitly, it came from initdb which
> itself got it from the environment of the server.
>
>
> Best regards,
From | Date | Subject | |
---|---|---|---|
Next Message | Ray Cote | 2019-05-02 18:22:46 | Debugging Failed Startup |
Previous Message | Sabit Nepal | 2019-05-02 17:14:31 | Postgres using inefficient query plan when using OR filter, uses correct indices when using IN clause |