Re: pg_collation.collversion for C.UTF-8

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pg_collation.collversion for C.UTF-8
Date: 2023-06-22 21:22:19
Message-ID: CA+hUKGK-wJoNsCr3vU-vfCp2X0eKH3+THzEbcbJ_UKxp0kLB=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 20, 2023 at 6:48 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Sat, 2023-06-17 at 17:54 +1200, Thomas Munro wrote:
> > > Would it be correct to interpret LC_COLLATE=C.UTF-8 as
> > > LC_COLLATE=C,
> > > but leave LC_CTYPE=C.UTF-8 as-is?
> >
> > Yes. The basic idea, at least for these two OSes, is that every
> > category behaves as if set to C, except LC_CTYPE.
>
> If that's true, and we version C.UTF-8, then users could still get the
> behavior they want, a stable collation order, and benefit from the
> optimized code path by setting LC_COLLATE=C and LC_CTYPE=C.UTF-8.
>
> The only caveat is to be careful with things that depend on ctype in
> indexes and constraints. While still a problem, it's a smaller problem
> than unversioned collation. We should think a little more about solving
> it, because I think there's a strong case to be made that a default
> collation of C and a database ctype of something else is a good
> combination (it makes less sense for a case-insensitive collation, but
> those aren't allowed as a default collation).
>
> In any case, we're better off following the rule "version anything that
> goes to any external provider, period". And by "version", I really mean
> a best effort, because we don't always have great information, but I
> think it's better to record what we do have than not. We have just seen
> too many examples of weird behavior. On top of that, it's simply
> inconsistent to assume that C=C.UTF-8 for collation version, but not
> for the collation implementation.

Yeah, OK, you're convincing me. It's hard to decide because our model
is basically wrong so it's only warning you about potential ctype
changes by happy coincidence, but even in respect of sort order it was
probably a mistake to start second-guessing what libc is doing, and
with that observation about the C/C.UTF-8 combination, at least an
end-user has a way to opt in/out of this choice. I'll try to write a
concise commit message for Daniel's patch explaining all this and we
can see about squeaking it into beta2.

> Use rs might get frustrated that the collation for C.UTF-8 is versioned,
> of course. But I don't think it will affect anyone for quite some time,
> because existing users will have a datcollversion=NULL; so they won't
> get the warnings until they refresh the versions (or create new
> collations/databases), and then after that upgrade libc. Right? So they
> should have time to adjust to use LC_COLLATE=C if that's what they
> want.

Yeah.

> An alternative would be to define lc_collate_is_c("C.UTF-8") == true
> while leaving lc_ctype_is_c("C.UTF-8") == false and
> get_collation_actual_version("C.UTF-8") == NULL. In that case we would
> not be passing it to an external provider, so we don't have to version
> it. But that might be a little too magical and I'm not inclined to do
> that.

Agreed, let's not do any more of that sort of thing.

> Another alternative would be to implement C.UTF-8 internally according
> to the "true" semantics, if they are truly simple and well-defined and
> stable. But I don't think ctype=C.UTF-8 is actually stable because new
> characters can be added, right?

Correct.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-06-22 22:22:54 Re: Remove deprecation warnings when compiling PG ~13 with OpenSSL 3.0~
Previous Message Greg Sabino Mullane 2023-06-22 20:56:43 Re: Bytea PL/Perl transform