Re: Built-in CTYPE provider

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Daniel Verite <daniel(at)manitou-mail(dot)org>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2023-12-20 22:57:16
Message-ID: 90c32479a1f486e5ecad89cb6fe5508d1ae4cfd5.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 2023-12-20 at 14:24 -0500, Robert Haas wrote:
> This makes sense to me, too, but it feels like it might work out
> better for speakers of English than for speakers of other languages.

There's very little in the way of locale-specific tailoring for ctype
behaviors in ICU or glibc -- only for the 'az', 'el', 'lt', and 'tr'
locales. While English speakers like us may benefit from being aligned
with the default ctype behaviors, those behaviors are not at all
specific to 'en' locales in ICU or glibc.

Collation varies a lot more between locales. I wouldn't call memcmp
ideal for English ('Zebra' comes before 'apple', which seems wrong to
me). If memcmp sorting does favor any particular group, I would say it
favors programmers more than English speakers. But that could just be
my perspective and I certainly understand the point that memcmp
ordering is more tolerable for some languages than others.

> Right now, I tend to get databases that default to en_US.utf8, and if
> the default changed to C.utf8, then the case-comparison behavior
> might
> be different

en_US.UTF-8 and C.UTF-8 have the same ctype behavior.

> For
> someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8, a
> change to C.utf8 would be a much bigger problem, I would think.

Those locales all have the same ctype behavior.

It turns out that that en_US.UTF-8 and fr_FR.UTF-8 also have the same
collation order -- no tailoring beyond root collation according to CLDR
files for 'en' and 'fr' (though note that 'fr_CA' does have tailoring).
That doesn't mean the experience of switching to memcmp order is
exactly the same for a French speaker and an English speaker, but I
think it's interesting.

> That might be OK if they don't care about
> ordering for any purpose other than equality lookups, but otherwise
> it's going to force them to change the default, where today they
> don't
> have to do that.

To be clear, I haven't proposed changing the initdb default. This
thread is about adding a builtin provider with builtin ctype, which I
believe a lot of users would like.

It also might be the best chance we have to get to a reasonable default
behavior at some point in the future. It would be always available,
fast, stable, better semantics than "C" for many locales, and we can
document it. In any case, we don't need to decide that now. If the
builtin provider is useful, we should do it.

Regards,
Jeff Davis

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2023-12-20 23:34:14 Re: Remove MSVC scripts from the tree
Previous Message Andrew Dunstan 2023-12-20 22:38:40 Re: Add --check option to pgindent