Re: Upgrading locale issues

From: "Daniel Verite" <daniel(at)manitou-mail(dot)org>
To: "rihad" <rihad(at)mail(dot)ru>
Cc: "Peter Geoghegan" <pg(at)bowt(dot)ie>,"pgsql-general General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Upgrading locale issues
Date: 2019-05-03 13:35:47
Message-ID: fce33d4f-458a-4208-8a4e-4207de86eb4d@manitou-mail.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

rihad wrote:

> Thanks, I'm a bit confused here. AFAIK indexes are used for at least two
> things: for speed and for skipping the ORDER BY step (since btree
> indexes are already sorted). Will such an "upgrade-immune" C.UTF-8 index
> still work correctly for table lookups?

If the lookup is based on a equality test or a pattern match with LIKE
or a regexp, it makes no difference. But the locale makes a
difference with inequality tests, such as < > or BETWEEN.

Around version 9.1 and in the pre-ICU days, Robert Haas wrote
this post that explained it pretty well, I think:
http://rhaas.blogspot.com/2012/03/perils-of-collation-aware-comparisons.html

Quote:

If you happen to need the particular sorting behavior that
collation-aware sorting and comparisons provide, then you may find
this price worth paying, but I suspect there are a lot of people out
there who are paying it more or less accidentally and don't really
care very much about the underlying sorting behavior. If, for
example, all of your queries are based on equality, and you don't
use greater-than or less-than tests, then it doesn't matter what
collation is in use. You might as well use "C" instead of whatever
your local default may be, because it's faster.

For non-English text, I would recommend C.UTF-8 over "C" because of
its better handling of Unicode characters. For instance:

=# select upper('été' collate "C"), upper('été' collate "C.UTF-8");
upper | upper
-------+-------
éTé | ÉTÉ

The "price" of linguistic comparisons that Robert mentioned was about
performance, but the troubles we have with the lack of long-term
immutability of these locales are worth being added to that.

> And can the existing en_US.UTF-8 fields' definition be altered in
> place, without a dump+restore?

Changing the COLLATE clause of a column with ALTER TABLE does
implicitly rebuild an index on this column if there is one,
A dump+restore is not needed, nor an explicit REINDEX.

The dump+restore is needed in another scenario, where you would
decide to change the LC_COLLATE and LC_CTYPE of the database,
instead of doing it only for some columns.
This scenario makes perfect sense if the locale of the database
has been set implicitly and it uses linguistic sorts without
really needing them ("accidentally" as said in the post).

> en_US.UTF-8 is the default encoding+locale+collation, it
> isn't set explicitly for any of our string columns. I assume there's
> some "catch-all" ordering taking place even for the C locale, so there
> won't be any bizarre things like b coming before a, or generally for any
> language, the second letter of its alphabet coming before the first?

'b' < 'a' is never true for any locale AFAIK, it is 'B' < 'a' that is
true for some locales such as C or C.UTF-8.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2019-05-03 13:36:51 Re: Migrating database(s) from Sybase ASE 15.7 to PostgreSQL 10.6 on Linux
Previous Message Adrian Klaver 2019-05-03 13:33:58 Re: Migrating database(s) from Sybase ASE 15.7 to PostgreSQL 10.6 on Linux