From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Daniel Verite <daniel(at)manitou-mail(dot)org> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Built-in CTYPE provider |
Date: | 2023-12-28 01:26:35 |
Message-ID: | 6b1370d5eaba5e8c42f54c05f7bc2b8e27b8db12.camel@j-davis.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote:
>
> But C.UTF-8 is not available everywhere, and there's still the
> problem that Unicode updates through libc are not aligned
> with Postgres releases.
Attached is an implementation of a built-in provider for the "C.UTF-8"
locale. That way applications (and tests!) can count on C.UTF-8 always
being available on any platform; and it also aligns with the Postgres
Unicode updates. Documentation is sparse and the patch is a bit rough,
but feedback is welcome -- it does have some basic tests which can be
used as a guide.
The C.UTF-8 locale, briefly, is a UTF-8 locale that provides simple
collation semantics (code point order) but rich ctype semantics
(lower/upper/initcap and regexes). This locale is for users who want
proper Unicode semantics for character operations (upper/lower,
regexes), but don't need a specific natural-language string sort order
to apply to all queries and indexes in their system. One might use it
as the database default collation, and use COLLATE clauses (i.e.
COLLATE UNICODE) where more specific behavior is needed.
The builtin C.UTF-8 locale has the following advantages over using the
libc C.UTF-8 locale:
* Collation performance: the builtin provider uses memcmp and
abbreviated keys. In libc, these advantages are only available for the
C locale.
* Unicode version is aligned with other parts of Postgres, like
normalization.
* Available on all platforms with exactly the same semantics.
* Testable and documentable.
* Avoids index corruption risks. In theory libc C.UTF-8 should also
have stable collation, but that is not 100% true. In the builtin
provider it is 100% stable.
Regards,
Jeff Davis
Attachment | Content-Type | Size |
---|---|---|
v14-0001-Minor-cleanup-for-unicode-update-build-and-test.patch | text/x-patch | 7.4 KB |
v14-0002-Add-Unicode-property-tables.patch | text/x-patch | 91.4 KB |
v14-0003-Add-unicode-case-mapping-tables-and-functions.patch | text/x-patch | 140.3 KB |
v14-0004-Catalog-changes-preparing-for-builtin-collation-.patch | text/x-patch | 46.3 KB |
v14-0005-Introduce-collation-provider-builtin-for-C-and-C.patch | text/x-patch | 63.1 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Corey Huinker | 2023-12-28 02:41:31 | Re: Statistics Import and Export |
Previous Message | Justin Pryzby | 2023-12-27 22:55:34 | Re: cannot abort transaction 2737414167, it was already committed |