Quick Links

Re: Built-in CTYPE provider

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Built-in CTYPE provider
Date:	2023-12-18 19:45:46
Message-ID:	d4616159b36de9451edd35fa7b2f36f299005c9c.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, 2023-12-15 at 16:30 -0800, Jeremy Schneider wrote:
> Looking closer, patches 3 and 4 look like an incremental extension of
> this earlier idea;

Yes, it's essentially the same thing extended to a few more files. I
don't know if "incremental" is the right word though; this is a
substantial extension of the idea.

> the perl scripts download data from unicode.org and
> we've specifically defined Unicode version 15.1 and the scripts turn
> the
> data tables inside-out into C data structures optimized for lookup.
> That
> C code is then checked in to the PostgreSQL source code files
> unicode_category.h and unicode_case_table.h - right?

Yes. The standard build process shouldn't be downloading files, so the
static tables are checked in. Also, seeing the diffs of the static
tables improves the visibility of changes in case there's some mistake
or big surprise.

> Am I reading correctly that these two patches add C functions
> pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we
> don't yet reference these functions anywhere? So this is just getting
> some plumbing in place?

Correct. Perhaps I should combine these into the builtin provider
thread, but these are independently testable and reviewable.

> >
> My prediction is that updating this built-in provider eventually
> won't
> be any different from ICU or glibc.

The built-in provider will have several advantages because it's tied to
a PG major version:

* A physical replica can't have different semantics than the primary.
* Easier to document and test.
* Changes are more transparent and can be documented in the release
notes, so that administrators can understand the risks and blast radius
at pg_upgrade time.

> Later on down the road, from a user perspective, I think we should be
> careful about confusion where providers are used inconsistently. It's
> not great if one function follow built-in Unicode 15.1 rules but
> another
> function uses Unicode 13 rules because it happened to call an ICU
> function or a glibc function. We could easily end up with multiple
> providers processing different parts of a single SQL statement, which
> could lead to strange results in some cases.

The whole concept of "providers" is that they aren't consistent with
each other. ICU, libc, and the builtin provider will all be based on
different versions of Unicode. That's by design.

The built-in provider will be a bit better in the sense that it's
consistent with the normalization functions, and the other providers
aren't.

Regards,
Jeff Davis

In response to

Re: Built-in CTYPE provider at 2023-12-16 00:30:39 from Jeremy Schneider

Responses

Re: Built-in CTYPE provider at 2023-12-19 20:59:03 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Daniel Verite	2023-12-18 20:35:53	Fixing backslash dot for COPY FROM...CSV
Previous Message	Tom Lane	2023-12-18 19:41:22	Re: add non-option reordering to in-tree getopt_long