Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2023-12-20 00:18:01
Message-ID: 6682ce676994b18eaa904591cb6618abebe2d3e8.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2023-12-19 at 15:59 -0500, Robert Haas wrote:
> FWIW, the idea that we're going to develop a built-in provider seems
> to be solid, for the reasons Jeff mentions: it can be stable, and
> under our control. But it seems like we might need built-in providers
> for everything rather than just CTYPE to get those advantages, and I
> fear we'll get sucked into needing a lot of tailoring rather than
> just
> being able to get by with one "vanilla" implementation.

For the database default collation, I suspect a lot of users would jump
at the chance to have "vanilla" semantics. Tailoring is more important
for individual collation objects than for the database-level collation.

There are reasons you might select a tailored database collation, like
if the set of users accessing it are mostly from a single locale, or if
the application connected to the database is expecting it in a certain
form.

But there are a lot of users for whom neither of those things are true,
and it makes zero sense to order all of the text indexes in the
database according to any one particular locale. I think these users
would prioritize stability and performance for the database collation,
and then use COLLATE clauses with ICU collations where necessary.

The question for me is how good the "vanilla" semantics need to be to
be useful as a database-level collation. Most of the performance and
stability problems come from collation, so it makes sense to me to
provide a fast and stable memcmp collation paired with richer ctype
semantics (as proposed here). Users who want something more probably
want the Unicode "root" collation, which can be provided by ICU today.

I am also still concerned that we have the wrong defaults. Almost
nobody thinks libc is a great provider, but that's the default, and
there were problems trying to change that default to ICU in 16. If we
had a builtin provider, that might be a better basis for a default
(safe, fast, always available, and documentable). Then, at least if
someone picks a different locale at initdb time, they would be doing so
intentionally, rather than implicitly accepting index corruption risks
based on an environment variable.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2023-12-20 00:39:27 Re: Use of additional index columns in rows filtering
Previous Message Michael Paquier 2023-12-20 00:04:00 Re: introduce dynamic shared memory registry