Quick Links

Re: Update Unicode data to Unicode 16.0.0

From:	vignesh C <vignesh21(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Nathan Bossart <nathandbossart(at)gmail(dot)com>, Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Update Unicode data to Unicode 16.0.0
Date:	2025-03-16 13:40:53
Message-ID:	CALDaNm3V7-pW_=Fb-DChdKS_mDiYCmxmvepfNQHP+SiVr3KaYg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sat, 8 Mar 2025 at 02:41, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Wed, 2025-03-05 at 20:43 -0600, Nathan Bossart wrote:
> > I see. Do we provide any suggested next steps for users to assess
> > the
> > potentially-affected relations?
>
> I don't know exactly where we should document it, but I've attached a
> SQL file that demonstrates what can happen for a PG17->PG18 upgrade,
> assuming that we've updated Unicode to 16.0.0 in PG18.
>
> The change in Unicode that I'm focusing on is the addition of U+A7DC,
> which is unassigned in Unicode 15.1 and assigned in Unicode 16, which
> lowercases to U+019B. The examples assume that the user is using
> unassigned code points in PG17/Unicode15.1 and the PG_C_UTF8 collation.
>
> The exmaple table 'test' has a single row with U+A7DC and an index
> test_idx on LOWER(t). Because the codepoint is unassigned in PG17,
> LOWER(U&'\A7DC') returns itself, while in PG18 it returns U&'\019B'.
>
> The first thing the user should do if they see the warnings during the
> --check phase is to rerun with "--check --retain" so they can keep the
> file. Maybe that's not great, and we should keep the file regardless
> for these particular kinds of warnings.
>
> Then go ahead and upgrade. After upgrade, the user should:
>
> REINDEX INDEX CONCURRENTLY test_idx;
>
> each potentially-affected index. If they have amcheck installed, they
> can do:
>
> select bt_index_check(index=>'test_idx', heapallindexed=>true);
>
> first, and only REINDEX if there's an error to fix. Do the same for
> other indexes listed in the warning file.
>
> The next example is tuniq, which is similar except that the index is a
> unique index, and the table has two values: U&'\019B' and U&'\A7DC'. In
> PG17, LOWER() returns U&'\019B' and U&'\A7DC', respectively (the former
> because it's already lowercase, and the latter because it's unassigned
> and returns itself). In PG18, LOWER() returns U&'\019B' and U&'\019B'.
> So, a REINDEX will encounter a uniqueness violation, and the user will
> need to decide how to handle the duplicate values.
>
> The next example is tcheck, which does CHECK(LOWER(t) = t), which is
> the most plausible check constraint I can think of that would be
> affected. In theory, the user might have to reason through what should
> be done about a tuple that violates it, but in practice it might be as
> simple as:
>
> UPDATE tcheck SET t = LOWER(t) WHERE t <> LOWER(t);
>
> The last example is partitioning where the table is range partitioned
> on LOWER(t), and the tuple ends up in the wrong partition on PG18.
> After fixing the indexes with:
>
> -- tpart0_lower_idx is fine, no REINDEX needed
> select bt_index_check(index=>'tpart0_lower_idx',
> heapallindexed=>true);
> select bt_index_check(index=>'tpart1_lower_idx',
> heapallindexed=>true);
> REINDEX INDEX CONCURRENTLY tpart1_lower_idx;
>
> See the problem with:
>
> SET enable_partition_pruning=off;
> -- correctly finds one tuple
> SELECT * FROM tpart WHERE LOWER(t) = U&'\019B';
>
> SET enable_partition_pruning TO DEFAULT;
> -- incorrectly finds zero tuples
> SELECT * FROM tpart WHERE LOWER(t) = U&'\019B';
>
> Now move the tuples to the right partitions:
>
> BEGIN;
> WITH d AS (
> DELETE FROM tpart0
> WHERE NOT (LOWER(t) >= U&'\0001' AND LOWER(t) < U&'\8000')
> RETURNING *
> ) INSERT INTO tpart SELECT * FROM d;
> WITH d AS (
> DELETE FROM tpart1
> WHERE NOT (LOWER(t) >= U&'\8000' AND LOWER(t) < U&'\FFFF')
> RETURNING *
> ) INSERT INTO tpart SELECT * FROM d;
> COMMIT;
>
> The order of operations should be to fix indexes, unique constraints,
> and check constraints first; and then to fix partitioned tables. That
> way the above partitioned table queries get correct results using the
> indexes.
>
> Granted, this is not a great user experience, and we could do better in
> the future (e.g. add some new utility commands). But this is an
> improvement in the sense that it's a small number of affected objects,
> unlike a collation version change, which affects anything related to
> text.

We currently have two Commitfest entries for the same thread at [1]
and [2]. Are both still necessary, or can we consolidate tracking into
a single entry?
[1] - https://commitfest.postgresql.org/patch/5472/
[2] - https://commitfest.postgresql.org/patch/5563/

Regards,
Vignesh

In response to

Re: Update Unicode data to Unicode 16.0.0 at 2025-03-07 21:11:18 from Jeff Davis

Responses

Re: Update Unicode data to Unicode 16.0.0 at 2025-03-17 17:33:31 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Daniel Gustafsson	2025-03-16 13:48:58	Re: Adding support for SSLKEYLOGFILE in the frontend
Previous Message	vignesh C	2025-03-16 13:32:11	Re: Implement waiting for wal lsn replay: reloaded