Re: [EXTERNAL] Re: Windows Application Issues | PostgreSQL | REF # 48475607

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Sandeep Thakkar <sandeep(dot)thakkar(at)enterprisedb(dot)com>
Cc: "Haifang Wang (Centific Technologies Inc)" <v-haiwang(at)microsoft(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rahul Pandey <pandeyrah(at)microsoft(dot)com>, Vishwa Deepak <Vishwa(dot)Deepak(at)microsoft(dot)com>, Shawn Steele <Shawn(dot)Steele(at)microsoft(dot)com>, Amy Wishnousky <amyw(at)microsoft(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Shweta Gulati <gulatishweta(at)microsoft(dot)com>, Ashish Nawal <nawalashish(at)microsoft(dot)com>
Subject: Re: [EXTERNAL] Re: Windows Application Issues | PostgreSQL | REF # 48475607
Date: 2024-09-19 23:37:37
Message-ID: CA+hUKGLfrK33XpFXsRcc97a1Qa5Vz1YFEn4GC1vie7yse=ffPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Sep 5, 2024 at 11:46 PM Sandeep Thakkar
<sandeep(dot)thakkar(at)enterprisedb(dot)com> wrote:
> On Thu, Sep 5, 2024 at 5:46 AM Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote:
>> Really what I'm looking for is (1) feedback on the approach, code and
>> comments, and thoughts about more complex scenarios I may have failed
>> to think about, including say, pg_dump, pg_upgrade etc operational
>> issues, which probably involves lots of previous experience with
>> PostgreSQL, (2) opinions on whether we should add a test for these
>> cases and how to put the UTF-8 into a script (I'm confused about the
>> encoding of command line arguments), and (3) a nod from the EDB people
>> involved in distributing this software on Windows.

If I don't hear any objections to this plan soon, I'm going to commit
this and back-patch it into PostgreSQL 16 and PostgreSQL 17 after the
upcoming code freeze for the PostgreSQL 17 release ends. So it'll
probably be in 16.5 and 17.1.

> We can help with producing the builds with the patches provided. You had also
> mentioned about the changes required in the installer script, will it still be required?

If you don't change the installer script, then it will still fail if
someone selects "Türkiye" in your GUI, but now it will fail with an
ERROR rejecting non-ASCII characters, instead of crashing. So people
in Türkiye, Côte d'Ivoire, Curaçao etc will still have no way to
initialise a cluster with your GUI in PostgreSQL 16.5 and 17.1 unless
they follow the instructions on the web to create a "Turkey" (or
whatever non-ASCII string they want). Of course they could always use
initdb.exe directly from the command line with a BCP47 name. Maybe
that's OK, but I think you should consider changing the installer. A
conservative way to do it would be to show all the existing options
that you have now (so that someone who is happy using the old style
names when they don't contain non-ASCII can keep doing so), but also
have a second entry for each country that shows "Turkish, Türkiye
(tr-TR)" and/or perhaps "Turkish, Türkiye (tr-TR.UTF-8)" or perhaps
both, and passes just that part in parentheses to initdb, to give
users all the options. Or perhaps you could have a checkbox "BCP 47
locales" that changes the list to show them.

No one has really reported any real world experience choosing between
the tr-TR vs tr-TR.UTF-8 alternatives, and you might like to
experiment with that. The second option makes Windows' system
libraries use UTF-8 encoding instead of the traditional encoding
associated with the language. As far as I can tell, it doesn't make
any difference at all to PostgreSQL yet, because your installer always
uses --encoding="UTF8" and, on Windows only, that makes PostgreSQL
ignore the locale's encoding and do a whole lot of internal
conversation to wchar_t because PostgreSQL doesn't yet know that
Windows 10+ can work with UTF-8 directly.

The reason that I am interested in this .UTF-8-or-not question is that
I'd like to consider *disallowing* non-matching encodings (see
commitfest entry #3772, reviewers wanted!), and teaching PostgreSQL
that Windows does in fact have UTF-8, just so we can delete a lot of
slow special case code, harmonise with Unix, and generally catch up
with reality. So I figure we might as well start encouraging the
"xx-XX.UTF-8" names when using --encoding="UTF8" if we can't find any
downside, because under that plan it would eventually become illegal
to use --locale="tr-TR" (no .UTF-8) with --encoding="UTF-8" if that
eventually goes in, so it seems sensible to stop creating new clusters
that way ASAP so that users have a better time upgrading in the
future. For example, a pg_upgrade from a PostgreSQL 17 cluster
initialised with --locale="tr-TR" --encoding="UTF8" to PostgreSQL 18
would proabbly require some extra step to rename "tr-TR" to
"tr-TR.UTF8" at some point (not sure exactly where), if PostgreSQL 18
starts rejecting the non-matching combination. I don't know where
that'll go, though -- it's not high priority work, it's just
incremental cleanup and modernisation that practically suggests itself
whenever looking at rejigging locale code for thread-safety and
reading all those comments about wchar_t that are not true.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message David Rowley 2024-09-20 02:36:47 Re: Volatile functions under Memoize node
Previous Message Tom Lane 2024-09-19 21:35:33 Re: BUG #18545: \dt breaks transaction, calling error when executed in SET SESSION AUTHORIZATION