Windows UTF8 system locale

From: Noah Misch <noah(at)leadboat(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Windows UTF8 system locale
Date: 2024-12-15 02:32:21
Message-ID: 20241215023221.4d.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Since ~2019, Windows has option "Beta: Use Unicode UTF-8 for worldwide
language support". That option breaks the appendShellString() assumption that
it can escape every byte except '\0', '\r'. and '\n'. Instead, process
creation injects U+FFFD REPLACEMENT CHARACTER (UTF-8: ef bf bd) for each byte
of the command line not forming valid UTF-8. Here's the Windows Server 2025
output from a test program that sends bytes 0x80..0xFF in a CreateProcessA()
command line:

argv[1] = 58 ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
GetCommandLineA() = 61 20 58 ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
GetCommandLineW() = 61 20 58 fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd fffd

For PostgreSQL, I expect the most obvious problems will arise for rolname and
datname containing non-UTF8. For example, pg_dumpall relies on
appendShellString() to call pg_dump for arbitrary datname. pg_dumpall would
get "database ... does not exist". Some ways we might react:

1. Instead of arbitrary bytes in argv[], use a temporary PGSERVICEFILE. For
other kinds of appendShellString() input (mostly file paths), we could
provide other ways to pass them outside argv, or we could just not support
the full character repertoire in those. Windows "8.3 filenames" are a fair
workaround.

2. Just fail if the system option is enabled and we would appendShellString()
a non-UTF8 value.

3. Fail if we find U+FFFD in arguments. It's valid Unicode, though.

I plan not to work on this myself, and I'm not advocating this as a priority
to anyone else. I'm just sending this to record what I learned, in case it
helps someone for whom it does become a priority.

https://stackoverflow.com/a/57134096/16371536 shows how to enable the option.
I'd be interested to hear test results with that enabled. My hypothesis is
that 010_dump_connstr.pl and 200_connstr.pl would fail. (My Windows
development environments are all too old, and I stopped short of building a
new one for this.) It should also be possible to test this in CI by building
an image with the following https://github.com/anarazel/pg-vm-images.git
modification:

--- a/scripts/windows_install_dbg.ps1
+++ b/scripts/windows_install_dbg.ps1
@@ -9,6 +9,15 @@ mkdir c:\t
cd c:\t


+echo "enabling UTF8"
+Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' `
+ -Name 'ACP' -Value '65001'
+Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' `
+ -Name 'OEMCP' -Value '65001'
+Set-ItemProperty -Path 'HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage' `
+ -Name 'MACCP' -Value '65001'
+
+
echo "configuring windows error reporting"

# prevent windows error handling dialog from causing hangs

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Roberto C. Sánchez 2024-12-15 02:50:23 Backport of CVE-2024-10978 fix to older pgsql versions (11, 9.6, and 9.4)
Previous Message Noah Misch 2024-12-15 02:27:01 Reject Windows command lines not convertible to system locale