Re: Built-in CTYPE provider

From: Jeremy Schneider <schneider(at)ardentperf(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2023-12-16 00:30:39
Message-ID: 578cf85a-1c27-434c-b89e-27a64d8e023a@ardentperf.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/13/23 5:28 AM, Jeff Davis wrote:
> On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote:
>> My biggest concern is around maintenance. Every year Unicode is
>> assigning new characters to existing code points, and those existing
>> code points can of course already be stored in old databases before
>> libs
>> are updated.
>
> Is the concern only about unassigned code points?
>
> I already committed a function "unicode_assigned()" to test whether a
> string contains only assigned code points, which can be used in a
> CHECK() constraint. I also posted[5] an idea about a per-database
> option that could reject the storage of any unassigned code point,
> which would make it easier for users highly concerned about
> compatibility.

I didn't know about this. Did a few smoke tests against today's head on
git and it's nice to see the function working as expected. :)

test=# select unicode_version();
unicode_version
-----------------
15.1

test=# select chr(3212),unicode_assigned(chr(3212));
chr | unicode_assigned
-----+------------------
ಌ | t

-- unassigned code point inside assigned block
test=# select chr(3213),unicode_assigned(chr(3213));
chr | unicode_assigned
-----+------------------
಍ | f

test=# select chr(3214),unicode_assigned(chr(3214));
chr | unicode_assigned
-----+------------------
ಎ | t

-- unassigned block
test=# select chr(67024),unicode_assigned(chr(67024));
chr | unicode_assigned
-----+------------------
𐗐 | f

test=# select chr(67072),unicode_assigned(chr(67072));
chr | unicode_assigned
-----+------------------
𐘀 | t

Looking closer, patches 3 and 4 look like an incremental extension of
this earlier idea; the perl scripts download data from unicode.org and
we've specifically defined Unicode version 15.1 and the scripts turn the
data tables inside-out into C data structures optimized for lookup. That
C code is then checked in to the PostgreSQL source code files
unicode_category.h and unicode_case_table.h - right?

Am I reading correctly that these two patches add C functions
pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we
don't yet reference these functions anywhere? So this is just getting
some plumbing in place?

>> And we may end up with
>> something like the timezone database where we need to periodically
>> add a
>> more current ruleset - albeit alongside as a new version in this
>> case.
>
> There's a build target "update-unicode" which is run to pull in new
> Unicode data files and parse them into static C arrays (we already do
> this for the Unicode normalization tables). So I agree that the tables
> should be updated but I don't understand why that's a problem.

I don't want to get stuck on this. I agree with the general approach of
beginning to add a provider for locale functions inside the database. We
have awhile before Unicode 16 comes out. Plenty of time for bikeshedding

My prediction is that updating this built-in provider eventually won't
be any different from ICU or glibc. It depends a bit on how we
specifically built on this plumbing - but when Unicode 16 comes out, i
I'll try to come up with a simple repro on a default DB config where
changing the Unicode version causes corruption (it was pretty easy to
demonstrate for ICU collation, if you knew where to look)... but I don't
think that discussion should derail this commit, because for now we're
just starting the process of getting Unicode 15.1 into the PostgreSQL
code base. We can cross the "update" bridge when we come to it.

Later on down the road, from a user perspective, I think we should be
careful about confusion where providers are used inconsistently. It's
not great if one function follow built-in Unicode 15.1 rules but another
function uses Unicode 13 rules because it happened to call an ICU
function or a glibc function. We could easily end up with multiple
providers processing different parts of a single SQL statement, which
could lead to strange results in some cases.

Ideally a user just specifies a default provider their database, and the
rules for that version of Unicode are used as consistently as possible -
unless a user explicitly overrides their choice in a table/column
definition, query, etc. But it might take a little time and work to get
to this point.

-Jeremy

--
http://about.me/jeremy_schneider

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeremy Schneider 2023-12-16 00:48:23 Re: encoding affects ICU regex character classification
Previous Message Matthias van de Meent 2023-12-15 21:11:59 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements