Re: proposal: change behavior on collation version mismatch

From: Jeremy Schneider <schnjere(at)amazon(dot)com>
To: Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: proposal: change behavior on collation version mismatch
Date: 2023-11-30 01:03:45
Message-ID: 88ab1cc5-2d6f-4d67-8f3b-3e10f2e52b3d@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/28/23 2:12 AM, Daniel Verite wrote:
> Jeremy Schneider wrote:
>> 1) "collation changes are uncommon" (which is relatively correct)
>> 2) "most users would rather have ease-of-use than 100% safety, since
>> it's uncommon"
>>
>> And I think this led to the current behavior of issuing a warning rather
>> than an error,
> There's a technical reason for this being a warning.
> If it was an error, any attempt to do anything with the collation
> would fail, which includes REINDEX on indexes using
> that collation.
> And yet that's precisely what you're supposed to do in that
> situation.

Indexes are the most obvious and impactful corruption, so the focus is
understandable, but there's a bit of a myth in the general public that
REINDEX means you fixed your database.  I'm concerned that too many
people believe this falsehood, and don't realize that things like
constraints and partitions can also be affected by a major OS update
when leaving PG data files in place.  Also there's a tendancy to use
amcheck and validate btree indexes, but skip other index types.  And of
course none of this is possible when people mistakenly use a different
major OS for the hot standby (but Postgres willingly sends incorrect
query results to users).

This is why my original proposal included an update to the ALTER ...
REFRESH/COLLATION docs.  Today's conventional wisdom suggests this is a
safe command.  It's really not, if you're using unicode (which everyone
is). Fifteen years ago, you needed to buy a french keyboard to type
french accented characters.  Today it's a quick tap on your phone to get
chinese, russian, tibetan, emojis, and any other character you can dream
of.  All of those surprising characters eventually get stored in Postres
databases, often to the surprise of devs and admins, after they discover
corruption from an OS upgrade.

And to recap some data about historical ICU versions from the torture test:

ICU Version | OS Version | en-US characters changed collation |
zh-Hans-CN characters changed collation | fr-FR characters changed collation
55.1-7ubuntu0.5 | Ubuntu 16.04.7 LTS | 286,654 | 286,654 | 286,654
60.2-3ubuntu3.1 | Ubuntu 18.04.6 LTS | 23,741 | 24,415 | 23,741
63.1-6 | Ubuntu 19.04 | 688 | 688 | 688
66.1-2ubuntu2 | Ubuntu 20.04.3 LTS | 6,497 | 6,531 | 6,497
70.1-2 | Ubuntu 22.04 LTS | 879 | 887 | 879

The very clear trend here is that most changes are made in the root
collation rules, affecting all locales.  This means that worrying about
specific collation versions of different locales is really focusing on
an irrelevant edge case.  In ICU development, all the locales tend to
change.

If anyone thinks the Collation Apocalypse is bad now, I predict the
Kubernetes wave will be mayhem.  Fifteen years ago it was rare to
physically move PG datafiles to a new major OS.  Most people would dump
and load their databases, sized in GBs.  Today's multi-TB Postgres
databases have meant an increase of in-place OS upgrades in recent
years.  People started to either detach/attach their storage, or they
used a hot standby. Kubernetes will make these moves across major OS's a
daily, effortless occurrence.

ICU doesn't fix anything directly.  We do need ICU - only because it
finally enables us to compile that old version of ICU forever on every
new OS we move to going forward. This was simply impossible with glibc.
Over the past couple decades, not even Oracle or IBM has managed to
deprecate a single version of ICU from a relational database, and not
for lack of desire.

-Jeremy

--
Jeremy Schneider
Performance Engineer
Amazon Web Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2023-11-30 01:06:48 Re: pg_upgrade and logical replication
Previous Message Jeff Davis 2023-11-30 00:23:22 Re: encoding affects ICU regex character classification