Re: Windows default locale vs initdb

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Juan José Santamaría Flecha <juanjo(dot)santamaria(at)gmail(dot)com>, Ertan Küçükoglu <ertan(dot)kucukoglu(at)gmail(dot)com>
Subject: Re: Windows default locale vs initdb
Date: 2024-08-07 04:15:30
Message-ID: CA+hUKGJ=ca39Cg=y=S89EaCYvvCF8NrZRO=uog-cnz0VzC6Kfg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 23, 2024 at 11:19 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
> On Tue, Jul 23, 2024 at 1:44 AM Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> > I have an environment I can use for testing. But what exactly am I
> > testing? :-) Install a few "problem" language/region settings, switch
> > the system and ensure initdb runs ok?

I thought a bit more about what to do with the messy .UTF-8 situation
on Windows, and I think I might see a way forward that harmonises the
code and behaviour with Unix, and deletes a lot of special case code.
But it's only theories + CI so far.

0001, 0002: As before, teach initdb.exe to choose eg "en-US" by default.

0003: Force people to choose locales that match the database
encoding, as we do on Unix. That is, forbid contradictory
combinations like --locale="English_United States.1252"
--encoding=UTF8, which are currently allowed (and the world is full of
such database clusters because that is how the EDB installer GUI makes
them). The only allowed combinations for American English should now
be: --locale="en-US" --encoding="WIN1252", and --locale="en-US.UTF-8"
--encoding="UTF8". You can still use the old names if you like, by
explicitly writing --locale="English_United States.1252", but the
encoding then has to be WIN1252. It's crazy to mix them up, let's ban
that.

Obviously there is a pg_upgrade case to worry about there. We'd have
to "fix" the now illegal combinations, and I don't know exactly how
yet.

0004: Rip out the code that does extra wchar_t conversations for
collations. If I've understood correctly, we don't need them: if you
have a .UTF-8 locale then your encoding is UTF-8 and should be able to
use strcoll_l() directly. Right?

0005: Something similar was being done for strftime(). And we might
as well use strftime_l() instead while we're here (part of general
movement to use _l functions and stop splattering setlocale() all over
the place, for the multithreaded future).

These patches pass on CI. Do they give the expected results when used
on a real Windows system?

There are a few more places where we do wchar_t conversions that could
probably be stripped out too, if my assumptions are correct, and we
could dig further if the basic idea can be validated and people think
this is going in a good direction.

Attachment Content-Type Size
v6-0001-MinGW-has-GetLocaleInfoEx.patch text/x-patch 1.4 KB
v6-0002-Default-to-IETF-BCP-47-locale-names-in-initdb-on-.patch text/x-patch 4.2 KB
v6-0003-Don-t-allow-UTF-8-with-non-UTF-8-locales-on-Windo.patch text/x-patch 4.5 KB
v6-0004-Collate-UTF-8-without-wchar_t-conversion-in-Windo.patch text/x-patch 3.7 KB
v6-0005-Format-times-without-wchar_t-conversion-in-Window.patch text/x-patch 9.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-08-07 04:42:00 Re: Logical Replication of sequences
Previous Message Alexander Korotkov 2024-08-07 04:08:11 pgsql: Introduce hash_search_with_hash_value() function