Re: Add CASEFOLD() function.

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Joe Conway <mail(at)joeconway(dot)com>, Ian Lawrence Barwick <barwick(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Add CASEFOLD() function.
Date: 2024-12-16 17:49:22
Message-ID: c3801da038f8f59ee63deb9104af1147d49454c8.camel@j-davis.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, 2024-12-12 at 13:55 -0500, Joe Conway wrote:
> > I don't have a strong opinion here so I will just go with whatever
> > seems like the popular choice.
>
>
> FWIW I prefer casefold()

Done. I just noticed that it now matches $SUBJECT. The fact that my
code didn't match the email subject before further supports the idea
that "foldcase" was never quite as natural -- so I agree that
"casefold" is the way to go.

One question I have is whether we want this function to normalize the
output.

I believe most usecases would want the output normalized, because
normalization differences (e.g. "a" U+0061 followed by "combining
acute" U+0301 vs "a with acute" U+00E1) are more minor than differences
in case.

Of course, a user could wrap it with the normalize() function, but
that's verbose and easy to forget. I'm also not sure that it can be
made as fast as a combined function that does both.

And a follow-up question: if it does normalize, the second parameter
would be the requested normal form. But to accept the keyword forms
(NFC, NFD in gram.y) rather than the string forms ('NFC', 'NFD') then
we'd need to also need to add CASEFOLD to gram.y (like NORMALIZE). Is
that a reasonable thing to do?

Regards,
Jeff Davis

Attachment Content-Type Size
v2-0001-Add-support-for-Unicode-case-folding.patch text/x-patch 569.3 KB
v2-0002-Add-SQL-function-CASEFOLD.patch text/x-patch 13.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Paul Ramsey 2024-12-16 17:50:39 Pg18 Recursive Crash
Previous Message Jelte Fennema-Nio 2024-12-16 17:33:29 Re: Improving default column names/aliases of subscript text expressions