[17] collation provider "builtin"

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [17] collation provider "builtin"
Date: 2023-06-14 22:55:05
Message-ID: 9d63548c4d86b0f820e1ff15a83f93ed9ded4543.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The locale "C" (and equivalently, "POSIX") is not really a libc locale;
it's implemented internally with memcmp for collation and
pg_ascii_tolower, etc., for ctype.

The attached patch implements a new collation provider, "builtin",
which only supports "C" and "POSIX". It does not change the initdb
default provider, so it must be requested explicitly. The user will be
guaranteed that collations with provider "builtin" will never change
semantics; therefore they need no version and indexes are not at risk
of corruption. See previous discussion[1].

(Caveat: the "C" locale ordering may depend on the specific encoding.
For UTF-8, memcmp is equivalent to code point order, but that may not
be true of other encodings. Encodings can't change during pg_upgrade,
so indexes are not at risk; but the encoding can change during
dump/reload so results may change.)

This built-in provider is just here to support "C" and "POSIX" using
memcmp/pg_ascii_*, and no other locales. It is not intended as a
general license to take on the problem of maintaining locales. We may
support some other locale name to mean "code point order", but like
UCS_BASIC, that would just be an alias for locale "C" that also checks
that the encoding is UTF-8.

Motivation:

Why not just use the "C" locale with the libc provider?

1. It's more clear to the user what's going on: Postgres is managing
the provider; we aren't passing it on to libc at all. With the libc
provider, something like C.UTF-8 leaves room for confusion[2]; with the
built-in provider, "C.UTF-8" is not a supported locale and the user
will get an error if it's requested.

2. The libc provider conflates LC_COLLATE/LC_CTYPE with the default
collation; whereas in the icu and built-in providers, they are separate
concepts. With ICU and builtin, you can set LC_COLLATE and LC_CTYPE for
a database to whatever you want at creation time

3. If you use libc with locale "C", then future CREATE DATABASE
commands will default to the libc provider (because that would be the
provider for template0), which is not what the user wants if the
purpose is to avoid problems with external collation providers. If you
use the built-in provider instead, then future CREATE DATABASE commands
will only succeed if the user either specifies locale C or explicitly
chooses a new provider; which will allow them a chance to prepare for
any challenges.

4. It makes it easier to document the trade-offs between various
providers without confusing special cases around the C locale.

[1]
https://www.postgresql.org/message-id/87sfb4gwgv.fsf%40news-spur.riddles.org.uk
[2]
https://www.postgresql.org/message-id/8a3dc06f-9b9d-4ed7-9a12-2070d8b0165f@manitou-mail.org

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachment Content-Type Size
v11-0001-Introduce-collation-provider-builtin.patch text/x-patch 42.9 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2023-06-14 23:20:30 Re: [17] collation provider "builtin"
Previous Message Nathan Bossart 2023-06-14 22:46:08 Re: add non-option reordering to in-tree getopt_long