Re: Order changes in PG16 since ICU introduction

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Regina Obe <lr(at)pcorp(dot)us>, Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, Sandro Santilli <strk(at)kbt(dot)io>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Order changes in PG16 since ICU introduction
Date: 2023-04-21 20:33:48
Message-ID: CA+TgmobxgvonJT8vO_v0k9_rgDmn0CKvTe48Z=w95rrBqC95KA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 21, 2023 at 3:25 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> I am also having second thoughts about accepting "C" or "POSIX" as an
> ICU locale and transforming it to "en-US-u-va-posix" in v16. It's not
> terribly useful (why not just use memcmp()?), it's not fast in my
> measurements (en-US is faster), so maybe it's better to just throw an
> error and tell the user to use C (or provider=none as I suggest
> above)?

I mean, to renew a complaint I've made previously, how the heck is
anyone supposed to understand what's going on here?

We have no meaningful documentation of how to select an ICU locale
that works for you. We have a couple of examples and a suggestion that
you should use BCP 47. But when I asked before for documentation
references, the ones you provided were not clear, basically
incomprehensible. In follow-up discussion, you admitted you'd had to
consult the source code to figure certain things out.

And the fact that "C" or "POSIX" gets transformed into
"en-US-u-va-posix" is also completely documented. That string appears
twice in the code, but zero times in the documentation. There's code
to do it, but users shouldn't have to read code, and it wouldn't help
much if they did, because the code comments don't really explain the
rationale behind this choice either.

I find the fact that people are having trouble here completely
predictable. Of course if people ask for "C" and the system tells them
that it's using "en-US-u-va-posix" instead they're going to be
confused and ask questions, exactly as is happening here. glibc
collations aren't particularly well-documented either, but people have
some experience with, and they can get a list of values that have a
chance of working from /usr/share/locale, and they know what "C"
means. Nobody knows what "en-US-u-va-posix" is. It's not even
Googleable, really, whereas "C locale" is.

My opinion is that the switch to using ICU by default is ill-advised
and should be reverted. The compatibility break isn't worth whatever
advantages ICU may have, the documentation to allow people to
transition to ICU with reasonable effort doesn't exist, and the fact
that within weeks of feature freeze people who know a lot about
PostgreSQL are struggling to get the behavior they want is a really
bad sign.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2023-04-21 20:33:58 Re: optimize several list functions with SIMD intrinsics
Previous Message Nathan Bossart 2023-04-21 20:33:17 Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada