From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
Cc: | Daniel Verite <daniel(at)manitou-mail(dot)org>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: pg_collation.collversion for C.UTF-8 |
Date: | 2023-06-22 21:22:19 |
Message-ID: | CA+hUKGK-wJoNsCr3vU-vfCp2X0eKH3+THzEbcbJ_UKxp0kLB=g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Jun 20, 2023 at 6:48 AM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Sat, 2023-06-17 at 17:54 +1200, Thomas Munro wrote:
> > > Would it be correct to interpret LC_COLLATE=C.UTF-8 as
> > > LC_COLLATE=C,
> > > but leave LC_CTYPE=C.UTF-8 as-is?
> >
> > Yes. The basic idea, at least for these two OSes, is that every
> > category behaves as if set to C, except LC_CTYPE.
>
> If that's true, and we version C.UTF-8, then users could still get the
> behavior they want, a stable collation order, and benefit from the
> optimized code path by setting LC_COLLATE=C and LC_CTYPE=C.UTF-8.
>
> The only caveat is to be careful with things that depend on ctype in
> indexes and constraints. While still a problem, it's a smaller problem
> than unversioned collation. We should think a little more about solving
> it, because I think there's a strong case to be made that a default
> collation of C and a database ctype of something else is a good
> combination (it makes less sense for a case-insensitive collation, but
> those aren't allowed as a default collation).
>
> In any case, we're better off following the rule "version anything that
> goes to any external provider, period". And by "version", I really mean
> a best effort, because we don't always have great information, but I
> think it's better to record what we do have than not. We have just seen
> too many examples of weird behavior. On top of that, it's simply
> inconsistent to assume that C=C.UTF-8 for collation version, but not
> for the collation implementation.
Yeah, OK, you're convincing me. It's hard to decide because our model
is basically wrong so it's only warning you about potential ctype
changes by happy coincidence, but even in respect of sort order it was
probably a mistake to start second-guessing what libc is doing, and
with that observation about the C/C.UTF-8 combination, at least an
end-user has a way to opt in/out of this choice. I'll try to write a
concise commit message for Daniel's patch explaining all this and we
can see about squeaking it into beta2.
> Use rs might get frustrated that the collation for C.UTF-8 is versioned,
> of course. But I don't think it will affect anyone for quite some time,
> because existing users will have a datcollversion=NULL; so they won't
> get the warnings until they refresh the versions (or create new
> collations/databases), and then after that upgrade libc. Right? So they
> should have time to adjust to use LC_COLLATE=C if that's what they
> want.
Yeah.
> An alternative would be to define lc_collate_is_c("C.UTF-8") == true
> while leaving lc_ctype_is_c("C.UTF-8") == false and
> get_collation_actual_version("C.UTF-8") == NULL. In that case we would
> not be passing it to an external provider, so we don't have to version
> it. But that might be a little too magical and I'm not inclined to do
> that.
Agreed, let's not do any more of that sort of thing.
> Another alternative would be to implement C.UTF-8 internally according
> to the "true" semantics, if they are truly simple and well-defined and
> stable. But I don't think ctype=C.UTF-8 is actually stable because new
> characters can be added, right?
Correct.
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2023-06-22 22:22:54 | Re: Remove deprecation warnings when compiling PG ~13 with OpenSSL 3.0~ |
Previous Message | Greg Sabino Mullane | 2023-06-22 20:56:43 | Re: Bytea PL/Perl transform |