Re: Update Unicode data to Unicode 16.0.0

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Peter Eisentraut <peter(at)eisentraut(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Update Unicode data to Unicode 16.0.0
Date: 2025-03-07 21:11:18
Message-ID: 16c4e37d4c89e63623b009de9ad6fb90e7456ed8.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2025-03-05 at 20:43 -0600, Nathan Bossart wrote:
> I see.  Do we provide any suggested next steps for users to assess
> the
> potentially-affected relations?

I don't know exactly where we should document it, but I've attached a
SQL file that demonstrates what can happen for a PG17->PG18 upgrade,
assuming that we've updated Unicode to 16.0.0 in PG18.

The change in Unicode that I'm focusing on is the addition of U+A7DC,
which is unassigned in Unicode 15.1 and assigned in Unicode 16, which
lowercases to U+019B. The examples assume that the user is using
unassigned code points in PG17/Unicode15.1 and the PG_C_UTF8 collation.

The exmaple table 'test' has a single row with U+A7DC and an index
test_idx on LOWER(t). Because the codepoint is unassigned in PG17,
LOWER(U&'\A7DC') returns itself, while in PG18 it returns U&'\019B'.

The first thing the user should do if they see the warnings during the
--check phase is to rerun with "--check --retain" so they can keep the
file. Maybe that's not great, and we should keep the file regardless
for these particular kinds of warnings.

Then go ahead and upgrade. After upgrade, the user should:

REINDEX INDEX CONCURRENTLY test_idx;

each potentially-affected index. If they have amcheck installed, they
can do:

select bt_index_check(index=>'test_idx', heapallindexed=>true);

first, and only REINDEX if there's an error to fix. Do the same for
other indexes listed in the warning file.

The next example is tuniq, which is similar except that the index is a
unique index, and the table has two values: U&'\019B' and U&'\A7DC'. In
PG17, LOWER() returns U&'\019B' and U&'\A7DC', respectively (the former
because it's already lowercase, and the latter because it's unassigned
and returns itself). In PG18, LOWER() returns U&'\019B' and U&'\019B'.
So, a REINDEX will encounter a uniqueness violation, and the user will
need to decide how to handle the duplicate values.

The next example is tcheck, which does CHECK(LOWER(t) = t), which is
the most plausible check constraint I can think of that would be
affected. In theory, the user might have to reason through what should
be done about a tuple that violates it, but in practice it might be as
simple as:

UPDATE tcheck SET t = LOWER(t) WHERE t <> LOWER(t);

The last example is partitioning where the table is range partitioned
on LOWER(t), and the tuple ends up in the wrong partition on PG18.
After fixing the indexes with:

-- tpart0_lower_idx is fine, no REINDEX needed
select bt_index_check(index=>'tpart0_lower_idx',
heapallindexed=>true);
select bt_index_check(index=>'tpart1_lower_idx',
heapallindexed=>true);
REINDEX INDEX CONCURRENTLY tpart1_lower_idx;

See the problem with:

SET enable_partition_pruning=off;
-- correctly finds one tuple
SELECT * FROM tpart WHERE LOWER(t) = U&'\019B';

SET enable_partition_pruning TO DEFAULT;
-- incorrectly finds zero tuples
SELECT * FROM tpart WHERE LOWER(t) = U&'\019B';

Now move the tuples to the right partitions:

BEGIN;
WITH d AS (
DELETE FROM tpart0
WHERE NOT (LOWER(t) >= U&'\0001' AND LOWER(t) < U&'\8000')
RETURNING *
) INSERT INTO tpart SELECT * FROM d;
WITH d AS (
DELETE FROM tpart1
WHERE NOT (LOWER(t) >= U&'\8000' AND LOWER(t) < U&'\FFFF')
RETURNING *
) INSERT INTO tpart SELECT * FROM d;
COMMIT;

The order of operations should be to fix indexes, unique constraints,
and check constraints first; and then to fix partitioned tables. That
way the above partitioned table queries get correct results using the
indexes.

Granted, this is not a great user experience, and we could do better in
the future (e.g. add some new utility commands). But this is an
improvement in the sense that it's a small number of affected objects,
unlike a collation version change, which affects anything related to
text.

Regards,
Jeff Davis

Attachment Content-Type Size
unicode-upgrade.sql application/sql 1.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2025-03-07 21:23:51 Re: AIO v2.5
Previous Message Robert Treat 2025-03-07 20:46:19 Re: Statistics Import and Export