Robocopy might be not robust enough for never-ending testing on Windows

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Robocopy might be not robust enough for never-ending testing on Windows
Date: 2024-09-14 13:00:00
Message-ID: ae26fb03-30f6-790d-fbc4-240b30af79de@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

While trying to reproduce inexplicable drongo failures (e. g., [1])
seemingly related to BackgroundPsql, I stumbled upon close, but not
the same issue. After many (6-8 thousands) iterations of the
015_stream.pl TAP test, psql failed to start with a STATUS_DLL_INIT_FAILED
error, and a corresponding Windows popup preventing a test exit (see ss-1
in the archive attached).

Upon reaching that state of the system, following test runs fail with one
or another error related to memory (not only with psql, but also with the
server processes):
testrun/subscription_13/015_stream/log/015_stream_publisher.log:2024-09-11 20:01:51.777 PDT [8812] LOG:  server process
(PID 11532) was terminated by exception 0xC00000FD
testrun/subscription_14/015_stream/log/015_stream_subscriber.log:2024-09-11 20:01:19.806 PDT [2036] LOG:  server process
(PID 10548) was terminated by exception 0xC00000FD
testrun/subscription_16/015_stream/log/015_stream_publisher.log:2024-09-11 19:59:41.513 PDT [9128] LOG:  server process
(PID 14476) was terminated by exception 0xC0000409
testrun/subscription_19/015_stream/log/015_stream_publisher.log:2024-09-11 20:03:27.801 PDT [10156] LOG:  server process
(PID 2236) was terminated by exception 0xC0000409
testrun/subscription_20/015_stream/log/015_stream_publisher.log:2024-09-11 19:59:41.359 PDT [10656] LOG:  server process
(PID 14712) was terminated by exception 0xC000012D
testrun/subscription_3/015_stream/log/015_stream_publisher.log:2024-09-11 20:02:23.815 PDT [13704] LOG:  server process
(PID 13992) was terminated by exception 0xC00000FD
testrun/subscription_9/015_stream/log/015_stream_publisher.log:2024-09-11 19:59:41.360 PDT [9760] LOG:  server process
(PID 11608) was terminated by exception 0xC0000142
...

When tests fail, I see Commit Charge reaching 100% (see ss-2 in the
attachment), while Physical Memory isn't all used up. To get OS to a
working state, I had to reboot it — killing processes, logoff/logon didn't
help.

I reproduced this issue again, investigated it and found out that it is
caused by robocopy (used by PostgreSQL::Test::Cluster->init), which is
leaking kernel objects, namely "Event objects" within Non-Paged pool on
each run.

This can be seen with Kernel Pool Monitor, or just with this simple PS script:
for ($i = 1; $i -le 100; $i++)
{
echo "iteration $i"
rm -r c:\temp\target
robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\initdb-template c:\temp\target
Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | % PoolNonpagedBytes
}

It shows to me:
iteration 1
               Total    Copied   Skipped  Mismatch    FAILED Extras
    Dirs :        27        27         0         0         0 0
   Files :       968       968         0         0         0 0
   Bytes :   38.29 m   38.29 m         0         0         0 0
   Times :   0:00:00   0:00:00                       0:00:00 0:00:00
...
1226063872
...
iteration 100
               Total    Copied   Skipped  Mismatch    FAILED Extras
    Dirs :        27        27         0         0         0 0
   Files :       968       968         0         0         0 0
   Bytes :   38.29 m   38.29 m         0         0         0 0
   Times :   0:00:00   0:00:00                       0:00:00 0:00:00
...
1245220864

(That is, 0.1-0.2 MB leaks per one robocopy run.)

I observed this on Windows 10 (Version 10.0.19045.4780), with all updates
installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using
robocopy v14393 on Windows 10 doesn't affect the issue.

Perhaps this information can be helpful for someone who is running
buildfarm/CI tests on Windows animals...

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-11%2007%3A24%3A53

Best regards,
Alexander

Attachment Content-Type Size
ss.zip application/zip 188.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Wolfgang Walther 2024-09-14 13:02:38 Regression tests fail with tzdata 2024b
Previous Message Hunaid Sohail 2024-09-14 12:28:39 Re: Psql meta-command conninfo+