Re: Built-in CTYPE provider

From: Peter Eisentraut <peter(at)eisentraut(dot)org>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Daniel Verite <daniel(at)manitou-mail(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeremy Schneider <schneider(at)ardentperf(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in CTYPE provider
Date: 2024-03-26 07:14:46
Message-ID: f7a98d94-1708-4d51-b7e9-0d07f7a1e46b@eisentraut.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 25.03.24 18:52, Jeff Davis wrote:
> OK, I'll propose a "title" or "titlecase" function for 18, along with
> "casefold" (which I was already planning to propose).

(Yay, casefold will be useful.)

> What do you think about UPPER/LOWER and full case mapping? Should there
> be extra arguments for full vs simple case mapping, or should it come
> from the collation?
>
> It makes sense that the "dotted vs dotless i" behavior comes from the
> collation because that depends on locale. But full-vs-simple case
> mapping is not really a locale question. For instance:
>
> select lower('0Σ' collate "en-US-x-icu") AS lower_sigma,
> lower('ΑΣ' collate "en-US-x-icu") AS lower_final_sigma,
> upper('ß' collate "en-US-x-icu") AS upper_eszett;
> lower_sigma | lower_final_sigma | upper_eszett
> -------------+-------------------+--------------
> 0σ | ας | SS
>
> produces the same results for any ICU collation.

I think of a collation describing what language a text is in. So it
makes sense that "dotless i" depends on the locale/collation.

Full vs. simple case mapping is more of a legacy compatibility question,
in my mind. There is some expectation/precedent that C.UTF-8 uses
simple case mapping, but beyond that, I don't see a reason why someone
would want to explicitly opt for simple case mapping, other than if they
need length preservation or something, but if they need that, then they
are going to be in a world of pain in Unicode anyway.

> There's also another reason to consider it an argument rather than a
> collation property, which is that it might be dependent on some other
> field in a row. I could imagine someone wanting to do:
>
> SELECT
> UPPER(some_field,
> full => true,
> dotless_i => CASE other_field WHEN ...)
> FROM ...

Can you index this usefully? It would only work if the user query
matches exactly this pattern?

> That makes sense for a function in the target list, because different
> customers might be from different locales and therefore want different
> treatment of the dotted-vs-dotless-i.

There is also the concept of a session collation, which we haven't
implemented, but it would address this kind of use. But there again the
problem is indexing. But maybe indexing isn't as important for case
conversion as it is for sorting.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-03-26 07:19:09 Re: Regression tests fail with musl libc because libpq.so can't be loaded
Previous Message Peter Eisentraut 2024-03-26 07:04:28 Re: Built-in CTYPE provider