Re: Order changes in PG16 since ICU introduction

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Joe Conway <mail(at)joeconway(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Order changes in PG16 since ICU introduction
Date: 2023-06-09 00:36:50
Message-ID: 73f97db9a37fd281bd5077ce16383cd8006b1f59.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 2023-06-06 at 21:37 +0100, Andrew Gierth wrote:
> > > > >
> I like either "internal" or "builtin" because they correctly identify
> that no external resources are used. I'm not keen on "postgresql".

"builtin" seems to be the winner. New patch series attached with doc
and test updates.

This has been a long discussion (it's a messy problem), but I think
I've addressed the most important concerns raised. If you disagree with
something, please indicate whether it's an objection, or a more minor
difference of opinion that I can weigh against other opinions. Also
please indicate if you think something is out of scope for 16.

Patches 0001, 0002:

These patches implement the built-in provider and automatically change
provider=icu to provider=builtin when the locale is C. Other approaches
were considered:
* Pretend that ICU can support the C locale, and use similar checks
throughout the code like the libc provider does: This was somewhat of a
hack, and had potential issues with upgraded clusters, and several
people seemed to reject it.
* Switch to the libc provider for the C locale: would make the libc
provider even more complicated and had some potential for confusion,
and also has catalog representation problems when --locale is specified
along with --lc-ctype.

Ultimately we need to choose one approach, and the built-in provider
seems the nicest (though most invasive). It reflects the reality that
we don't actually use libc or icu for the C locale, and it's nicer to
document. The builtin provider seemed to get the most support.

Patch 0003:

Makes LOCALE apply to all providers. The overall feel after this patch
is that "locale" now means the collation locale, and
LC_COLLATE/LC_CTYPE are for the server environment. When using libc,
LC_COLLATE and LC_CTYPE still work as they did before, but their
relationship to database collation feels more like a special case of
the libc provider. I believe most people favor this patch and I haven't
seen recent objections.

I didn't find any surprising behaviors, but there are a few that I'd
like to draw attention to:

0. If you initdb with --locale-provider=libc, and don't specify ICU at
any later point, then none of these changes should affect you and
you'll remain on libc. If someone notices otherwise, please let me
know.

1. If you specify --locale-provider=builtin at initdb time, you *must*
specify --locale=C/POSIX, otherwise you get an error.

2. Patch 0004 is possibly out of scope for 16, but it felt consistent
with the other UI changes and low risk. Please try with/without before
objecting.

3. Daniel Verite felt that we should only change the provider from ICU
to "builtin" for the C locale if the provider is defaulting to ICU; not
if it's specified as ICU. I did not differentiate between specifying
ICU and defaulting to ICU because:
a. "libc" unconditionally uses the built-in memcmp() logic for C, it
never actually uses libc
b. If a user really wants the root locale or the en-US-u-va-posix
locale, they can specify those directly
c. I don't see any plausible case where it helps a user to keep
provider=icu when locale=C.

4. Joe Conway and Peter Eisentraut both felt that C.UTF-8 with
provider=icu should not be changed to use the builtin provider, and
instead passed on to ICU. I implemented a compromise where initdb will
change C.UTF-8 to the built-in provider; but CREATE DATABASE/COLLATION
will pass it along to ICU (which may support it as en-US-u-va-posix in
some versions, or may throw an error in other versions). My reasoning
is that initdb is pulling from the environment, and we should try
harder to succeed on any reasonable environmental settings (otherwise
initdb with default settings could fail); whereas we can be more strict
with CREATE DATABASE/COLLATION.

5. For the built-in provider, initdb defaults to UTF-8 rather than
SQL_ASCII. Otherwise, you would be unable to use ICU at all later,
because ICU doesn't support SQL_ASCII.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Attachment Content-Type Size
v10-0001-Introduce-collation-provider-builtin.patch text/x-patch 43.3 KB
v10-0002-ICU-for-locale-C-automatically-use-builtin-provi.patch text/x-patch 14.4 KB
v10-0003-CREATE-DATABASE-make-LOCALE-apply-to-all-collati.patch text/x-patch 30.5 KB
v10-0004-CREATE-COLLATION-default-provider.patch text/x-patch 7.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jaime Casanova 2023-06-09 00:37:13 Re: git.postgresql.org seems to be down
Previous Message Joe Conway 2023-06-09 00:22:49 Re: git.postgresql.org seems to be down