Re: Reject Windows command lines not convertible to system locale

From: Noah Misch <noah(at)leadboat(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Reject Windows command lines not convertible to system locale
Date: 2024-12-15 05:05:53
Message-ID: 20241215050553.45.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Dec 15, 2024 at 04:50:57PM +1300, Thomas Munro wrote:
> On Sun, Dec 15, 2024 at 3:27 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > yielding mojibake
>
> Thank you for this magnificent terminology.

Thank you for reviewing.

> > What we can do safely is exit if we could not convert the command line to the
> > Windows ANSI code page (losslessly). Patch attached.
>
> This seems to make sense generally, since (based only on googling, I
> am not a Windows person and I don't even play one on TV), it doesn't
> sound like there is a way for the system to tell you that the
> conversion was lossy.
>
> For the all-ASCII test, why not us pg_is_ascii()?

pg_is_ascii() takes a char[], but this has a wchar_t[]. Maybe I should wrap
that part in a pg_wcs_is_ascii()? (We can't use GetCommandLineA(), because
"best fit" may have converted non-ASCII to ASCII.)

> I wonder if there are any legitimate uses that would fail spuriously
> due to the minutiae of normalisation, surrogates, etc. Or perhaps
> that type of thing is part of the problem you're talking about.

I wouldn't rule it out. My original plan was just to reject any wchar_t above
0xFF. That breaks innocent command lines. If GetACP()==1252, then
CreateProcessA(... 0x80 ...) puts 0x20ac (UTF-16 EURO SIGN) in
GetCommandLineW(). The conversion to GetCommandLineA() arrives back at 0x80,
so there's no reason to reject it.

> > As I mention in a code comment, I decided not to do anything about the similar
> > hazard in GetEnvironmentStringsW(). A possible alternative would be to check
> > each var referenced in PostgreSQL code or each var where the name begins with
> > "PG". Should we use one of those or another alternative?
>
> I would vote for having a table of variables we know about and
> checking them too, just on principle (it's not the characters the user
> intended). Checking every variable might reject legitimate UTF-16
> communication between parent and sub-processes other than ours, and
> even assuming that we own everything beginning "PG" seems an
> unnecessary overreach, which seems to leave only the known list idea.

Works for me. It's a little weird for psql to complain about a PGDATA value,
but the condition should be rare and easy enough for the user to fix. If
nobody else dislikes that plan, I'll give it a try. Perhaps I'd add a
src/tools/RELEASE_CHANGES task to check for new vars. Alternatively, some
check-world test could verify names appearing in getenv calls or
PQconninfoOptions.envvar fields. I don't think we have a source-tree-scanning
test like that today, but I wouldn't object to starting. We've talked about a
pgindent test, so visiting source code from a test is plausible.

> > The Perl testing calls powershell, which is new for our source tree. The
> > Internet thinks Windows Server 2008 R1 is the newest Windows where powershell
> > was optional. The buildfarm's oldest Windows is Windows Server 2008 R2. If
> > powershell is bad, there's likely a hairy way to write the test in any of
> > cmd.exe, cscript, Perl Win32::API, or C. My first choice was Perl
> > Win32::Process or IPC::Run. This test needs CreateProcessW(), but those two
> > use CreateProcessA().
>
> Sounds OK from here if it's the only sane way, and you could always skip?

A C program would be another sane way, but I would indeed opt to skip on
16-year-old Windows before rewriting in C.

> > I didn't test a CP_UTF8 Windows system. The new test coverage has a condition
> > intended to avoid failure there. (My Windows development environments are all
> > too old, and I stopped short of building a new one for this.) CP_UTF8 Windows
> > raises broader issues. I'll start a separate thread about them.
>
> Is it really still in beta?

The word "beta" appears next to its checkbox. That's all I know!

> Yeah, I can't immediately think of a good
> reason not to treat it the same as everything else and create more
> code paths, even if we think it should always succeed (but I'll look
> out for your thread, I'm quite interested in this general topic of
> "can we make a lot of Windows encoding troubles go away that way").
> It's a few extra CPU cycles but I guess it doesn't matter much.

postgr.es/m/flat/20241215023221(dot)4d(dot)nmisch(at)google(dot)com is that thread. So far,
it's only about new breakage.

> FWIW, coincidentally, I assume, I wrote about environ and argv
> encoding problems last week, including a test program showing wchar_t
> <-> ACP weirdness before main() among other interesting related
> effects[1].

Nice. I had made a note to look at that thread eventually, but I didn't know
Windows had come up there. My observations have been consistent with your
Windows findings in [1].

> (My focus has been on how we sort out ACP vs database
> encoding and block stuff that is b0rked in *our* code, ie below
> main(), because we don't understand the platform's encoding model, not
> so much the stuff that is already b0rked in all non-UTF8, non-wmain()
> Windows programs in general before even laying hands on the program
> counter. I had got as far as wondering out loud if forcing or
> strongly recommending UTF-8 ACP should be on the agenda to head off
> various related problems and generally make Windows not so different

My initial impression is not great, since it's painful with any database
having encoding!=UTF8. The nice thing about every other ACP is that you can
treat it as a POSIX-style, encoding-oblivious sequence of bytes. (Side note:
I don't understand how that is the case for code page 932 and other variable
length, non-UTF* code pages. It seems to work, but maybe I just haven't found
the byte sequences that break.) My hypothesis is that GetACP()==CP_UTF8 would
make PostgreSQL cooperate better with rest of the Windows world, but it would
make PostgreSQL cooperate worse with other POSIX transplants. Again, that's
just an initial impression.

> [1] https://www.postgresql.org/message-id/CA%2BhUKGL%3DF0pSLLf3nLpA_-sBwYsLg7s%3DFD6YFo_PDvS84FE_hw%40mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dilip Kumar 2024-12-15 05:15:17 Re: Skip collecting decoded changes of already-aborted transactions
Previous Message Thomas Munro 2024-12-15 03:50:57 Re: Reject Windows command lines not convertible to system locale