Quick Links

Re: Built-in CTYPE provider

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Built-in CTYPE provider
Date:	2023-12-28 01:26:35
Message-ID:	6b1370d5eaba5e8c42f54c05f7bc2b8e27b8db12.camel@j-davis.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote:
>
> But C.UTF-8 is not available everywhere, and there's still the
> problem that Unicode updates through libc are not aligned
> with Postgres releases.

Attached is an implementation of a built-in provider for the "C.UTF-8"
locale. That way applications (and tests!) can count on C.UTF-8 always
being available on any platform; and it also aligns with the Postgres
Unicode updates. Documentation is sparse and the patch is a bit rough,
but feedback is welcome -- it does have some basic tests which can be
used as a guide.

The C.UTF-8 locale, briefly, is a UTF-8 locale that provides simple
collation semantics (code point order) but rich ctype semantics
(lower/upper/initcap and regexes). This locale is for users who want
proper Unicode semantics for character operations (upper/lower,
regexes), but don't need a specific natural-language string sort order
to apply to all queries and indexes in their system. One might use it
as the database default collation, and use COLLATE clauses (i.e.
COLLATE UNICODE) where more specific behavior is needed.

The builtin C.UTF-8 locale has the following advantages over using the
libc C.UTF-8 locale:

* Collation performance: the builtin provider uses memcmp and
abbreviated keys. In libc, these advantages are only available for the
C locale.

* Unicode version is aligned with other parts of Postgres, like
normalization.

* Available on all platforms with exactly the same semantics.

* Testable and documentable.

* Avoids index corruption risks. In theory libc C.UTF-8 should also
have stable collation, but that is not 100% true. In the builtin
provider it is 100% stable.

Regards,
Jeff Davis

Attachment	Content-Type	Size
v14-0001-Minor-cleanup-for-unicode-update-build-and-test.patch	text/x-patch	7.4 KB
v14-0002-Add-Unicode-property-tables.patch	text/x-patch	91.4 KB
v14-0003-Add-unicode-case-mapping-tables-and-functions.patch	text/x-patch	140.3 KB
v14-0004-Catalog-changes-preparing-for-builtin-collation-.patch	text/x-patch	46.3 KB
v14-0005-Introduce-collation-provider-builtin-for-C-and-C.patch	text/x-patch	63.1 KB

In response to

Re: Built-in CTYPE provider at 2023-12-20 12:49:20 from Daniel Verite

Responses

Re: Built-in CTYPE provider at 2023-12-29 02:57:16 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Corey Huinker	2023-12-28 02:41:31	Re: Statistics Import and Export
Previous Message	Justin Pryzby	2023-12-27 22:55:34	Re: cannot abort transaction 2737414167, it was already committed