RE: [EXTERNAL] Re: Windows Application Issues | PostgreSQL | REF # 48475607

From: "Haifang Wang (Centific Technologies Inc)" <v-haiwang(at)microsoft(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Vishwa Deepak <Vishwa(dot)Deepak(at)microsoft(dot)com>
Cc: "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: [EXTERNAL] Re: Windows Application Issues | PostgreSQL | REF # 48475607
Date: 2024-05-15 23:46:33
Message-ID: PH8PR21MB3902402C3C3C20DD8CB40AFFE5EC2@PH8PR21MB3902.namprd21.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Thanks for your questions, Thomas and Tom. + @Vishwa to help with technical questions.

-----Original Message-----
From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Sent: Monday, May 13, 2024 11:38 PM
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Haifang Wang (Centific Technologies Inc) <v-haiwang(at)microsoft(dot)com>; pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: [EXTERNAL] Re: Windows Application Issues | PostgreSQL | REF # 48475607

On Tue, May 14, 2024 at 11:07 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> +1 for the long-term solution being more-stable locale identifiers.
> However, we should try to build something that will let users get out
> of these situations with the existing identifiers, so I like your idea
> of a plain-text mapping file for Windows locale names.
> I don't think an environment variable is necessary; just define a
> fixed name "$PGDATA/locale_map.txt" or such. If that file exists,
> just read it and map the pg_database field values with it.

OK, I tried that, first draft attached (with my standard proviso that I don't do Windows, I just know that this passes CI and that the code works the way I intended on my local Unix system if extracted into a little harness). With this, you could in theory create a file PGDATA/win32setlocale.map containing:

c Turkish_Turkey.1254=Turkish_Türkiye.1254

... or perhaps more likely:

c Turkish_Turkey.1254=tr-TR.1254

I also absorbed the pre-existing kludge table into the new system by default (though they got a bit shorter 'cause I invented some wildcards). Some problems came up while wondering how to fit Türkiye into the defaults, and how to back-patch:

1. In the back-branches, we claim to support ancient Windows releases as far back as "Windows 2000 SP4" (!), which obviously aren't getting the Windows updates, so I guess "Turkish_Türkiye.1254" will fail there and generally before Windows 10. And even if you exclude the extremities of our support window somehow (how?), modern systems might not have applied the update yet (IIUC they *have* to at some point under the new world order, so there is a defined window of version skew these days).

2. It's generally a terrible idea to be using "ü" in a locale name.
FWIW I assume setlocale() actually accepts and returns names encoded in the current ACP ("active codepage", system-wide changeable setting that controls char↔wchar_t conversion in system APIs), so the encoding of that file (and the built-in default table) would need to match that to work, as coded. Perhaps it would be possible to make the mapping file UTF-8 and transform that to ACP! But it feels a bit too loopy for me, and on the PostgreSQL side it is undefined/illegal whatever you choose in PostgreSQL due to being accessed from different databases which are using potentially different encodings that are only required to be a superset of ASCII. Avoid.

3. Therefore you'd probably want to prefer "tr-TR.1254" as the replacement string. But what is the oldest Windows release that can understand a BCP47 code like that?

4. Conversely, on modern systems, I'm still not entirely sure that "tr-TR.1254" is exactly the same thing as "Turkish-XXX.1254" and that it's OK to put ".1254" on the end like that. Is it, and is it? I don't mean just "does it mean Turkish?", I mean "does it give exactly the same answer for every conceivable pair of strings when compared with strcoll_l(), and likewise for the ctype-based functions like
towlower() et al".

If the answers are not in our favour, I guess we could leave the default behaviour unchanged, and let people set up a text file as shown above to fix their database if they want, but that's also not very nice and kinda weird (helping hypothetical users of museum-grade systems by leaving real users' systems broken).

If the answer to 4 is yes, yes then we could also push ahead with the plan to make initdb pick BCP47 names by default in PG18 (or even 17).

> Maybe this shouldn't even be Windows-specific? Are there any cases
> where it'd save people's bacon on other platforms?

Good question. Sometimes ISO code go away or countries split etc, so it's no like POSIX locale names are set in stone under all circumstances. But on Unixen it's all just files in practice, you can always just symlink them, move them around, compile them yourself from sources, etc, if you really have to, so I think I'd rather contain the crazy in win32*.c.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Bowen Shi 2024-05-16 02:21:59 Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Previous Message WU Yan 2024-05-15 23:03:24 Re: BUG #18466: Wrong row estimate for nested loop