Re: Windows installation problem at post-install step

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Windows installation problem at post-install step
Date: 2024-08-06 21:08:44
Message-ID: CA+hUKGLqXSbk9FrfWVvA-fXP7p1+r6WLnY9Xr_EpYasOaBf9dA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, Aug 6, 2024 at 11:44 PM Peter J. Holzer <hjp-pgsql(at)hjp(dot)at> wrote:
> I assume that "1254" here is the code page.
> But you specified --encoding=UTF-8 above, so your default locale uses a
> different encoding than the template databases. I would expect that to
> cause problems if the template databases contain any charecters where
> the encodings differ (such as "ü" in the locale name).

It's weird, but on Windows, PostgreSQL allows UTF-8 encoding with any
locale, and thus apparent contradictions:

/* See notes in createdb() to understand these tests */
if (!(locale_enc == user_enc ||
locale_enc == PG_SQL_ASCII ||
locale_enc == -1 ||
#ifdef WIN32
user_enc == PG_UTF8 ||
#endif
user_enc == PG_SQL_ASCII))
{
pg_log_error("encoding mismatch");

... and createdb's comments say that is acceptable because:

* 3. selected encoding is UTF8 and platform is win32. This is because
* UTF8 is a pseudo codepage that is supported in all locales since it's
* converted to UTF16 before being used.

At the time PostgreSQL was ported to Windows, UTF-8 was not a
supported encoding in "char"-based system interfaces like strcoll_l(),
and the port had to convert to "wchar_t" interfaces and call (in that
example) wcscoll_l(). On modern Windows it is, and there are two
locale names, with and without ".UTF-8" suffix (cf. glibc systems that
have "en_US" and "en_US.UTF-8" where the suffix-less version uses
whatever traditional encoding was used for that language before UTF-8
ate the world).

If we were doing the Windows port today, we'd probably not have that
special case for Windows, and we wouldn't have the wchar_t
conversions. Then I think we'd allow only:

--locale=tr-TR (defaults to --encoding=WIN1254)
--locale=tr-TR --encoding=WIN1254
--locale-tr-TR.UTF-8
--locale=tr-TR.UTF-8 --encoding=UTF-8

If we come up with an automated (or even manual but documented) way to
perform the "Turkish_Türkiye.1254" -> "tr-TR" upgrade as Dave was
suggesting upthread, we'll probably want to be careful to tidy up
these contradictory settings. For example I guess that American
databases initialised by EDB's installer must be using
--locale="English_United States.1252" and --encoding=UTF-8, and should
be changed to "en-US.UTF-8", while those initialised by letting
initdb.exe pick the encoding must be using --locale="English_United
States.1252" and --encoding=WIN1252 (implicit) and should be changed
to "en-US" to match the WIN1252 encoding.

Only if we did that update would we be able to consider removing the
extra UTF-16 conversions that are happening very frequently inside
PostgreSQL code, which is a waste of CPU cycles and programmer sanity.
(But that's all just speculation from studying the locale code -- I've
never really used Windows.)

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Ron Johnson 2024-08-06 21:43:13 Re: Standard of data storage and transformation
Previous Message yudhi s 2024-08-06 21:07:24 Standard of data storage and transformation