Quick Links

Re: Collation versioning

From:	Douglas Doole <dougdoole(at)gmail(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Christoph Berg <myon(at)debian(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Collation versioning
Date:	2018-09-16 18:12:51
Message-ID:	CADE5jYKn6FA0a35P+_HJuSb_abq-D5Z-nE6rO9brjpu8jBTMgA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sun, Sep 16, 2018 at 1:20 AM Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

> 3. Fix the tracking of when reindexes need to be rebuilt, so that you
> can't get it wrong (as you're alluding to above).
>

I've mentioned this in the past, but didn't seem to get any traction, so
I'll try it again ;-)

The focus on indexes when a collation changes is, in my opinion, the least
of the problems. You definitely have to worry about indexes, but they can
be easily rebuilt. What about other places where collation is hardened into
the system, such as constraints?

For example, in ICU 4.6 the handling of accents changed for French.
Previously accents were considered right-to-left but ICU 4.6 reversed this.
So consider a constraint like CHECK COL < 'coté' (last letter is U+00E9,
small letter e with acute). Prior to ICU 4.6 the value 'côte' (second
letter is U+00F4, small letter o with circumflex) would have passed this
constraint. With 4.6 or later it would be rejected because of the accent
ordering change. As soon as the collation changes, this table becomes
inconsistent and a reindex isn't going to help it. This becomes a data
cleansing problem at this point (which sucks for the user because their
data was clean immediately prior to the "upgrade").

There have similarly been cases where ICU changes have caused equal
characters to become unequal and vice versa. (Unfortunately all my ICU
notes with examples are at my previous employer.) Consider the effect on RI
constraints. The primary key can be fixed with a reindex (although dealing
with two existing values becoming equal is a pain). But then the user also
has to deal with the foreign keys since they may now have foreign keys
which have no match in the primary key.

And constraints problems are even easier than triggers. Consider a database
with complex BI rules that are implemented through triggers that fire when
values are/are not equal. If the equality of strings change, there could be
bad data throughout the tables. (At least with constraints the inter-column
dependencies are explicit in the catalogs. With triggers anything goes.)

All this collation stuff is great, and I know users want it, but it feels
like were pushing them out of an airplane with a ripped parachute every
time the collation libraries change. Maybe they'll land safely or maybe
things will get very messy.

- Doug
Salesforce

In response to

Re: Collation versioning at 2018-09-16 08:19:37 from Thomas Munro

Responses

Re: Collation versioning at 2018-09-16 18:39:12 from Andrew Gierth
Re: Collation versioning at 2018-09-16 20:13:34 from Thomas Munro
Re: Collation versioning at 2018-09-27 21:29:52 from Peter Eisentraut

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrew Gierth	2018-09-16 18:39:12	Re: Collation versioning
Previous Message	Tom Lane	2018-09-16 18:05:27	Re: XMLNAMESPACES (was Re: Clarification of nodeToString() use cases)