Re: Robocopy might be not robust enough for never-ending testing on Windows

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Robocopy might be not robust enough for never-ending testing on Windows
Date: 2024-09-14 14:22:11
Message-ID: 7e063c60-3fa8-4a63-8601-346788ed2233@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2024-09-14 Sa 9:00 AM, Alexander Lakhin wrote:
> Hello hackers,
>
> While trying to reproduce inexplicable drongo failures (e. g., [1])
> seemingly related to BackgroundPsql, I stumbled upon close, but not
> the same issue. After many (6-8 thousands) iterations of the
> 015_stream.pl TAP test, psql failed to start with a
> STATUS_DLL_INIT_FAILED
> error, and a corresponding Windows popup preventing a test exit (see ss-1
> in the archive attached).
>
> Upon reaching that state of the system, following test runs fail with one
> or another error related to memory (not only with psql, but also with the
> server processes):
> testrun/subscription_13/015_stream/log/015_stream_publisher.log:2024-09-11
> 20:01:51.777 PDT [8812] LOG:  server process (PID 11532) was
> terminated by exception 0xC00000FD
> testrun/subscription_14/015_stream/log/015_stream_subscriber.log:2024-09-11
> 20:01:19.806 PDT [2036] LOG:  server process (PID 10548) was
> terminated by exception 0xC00000FD
> testrun/subscription_16/015_stream/log/015_stream_publisher.log:2024-09-11
> 19:59:41.513 PDT [9128] LOG:  server process (PID 14476) was
> terminated by exception 0xC0000409
> testrun/subscription_19/015_stream/log/015_stream_publisher.log:2024-09-11
> 20:03:27.801 PDT [10156] LOG:  server process (PID 2236) was
> terminated by exception 0xC0000409
> testrun/subscription_20/015_stream/log/015_stream_publisher.log:2024-09-11
> 19:59:41.359 PDT [10656] LOG:  server process (PID 14712) was
> terminated by exception 0xC000012D
> testrun/subscription_3/015_stream/log/015_stream_publisher.log:2024-09-11
> 20:02:23.815 PDT [13704] LOG:  server process (PID 13992) was
> terminated by exception 0xC00000FD
> testrun/subscription_9/015_stream/log/015_stream_publisher.log:2024-09-11
> 19:59:41.360 PDT [9760] LOG:  server process (PID 11608) was
> terminated by exception 0xC0000142
> ...
>
> When tests fail, I see Commit Charge reaching 100% (see ss-2 in the
> attachment), while Physical Memory isn't all used up. To get OS to a
> working state, I had to reboot it — killing processes, logoff/logon
> didn't
> help.
>
> I reproduced this issue again, investigated it and found out that it is
> caused by robocopy (used by PostgreSQL::Test::Cluster->init), which is
> leaking kernel objects, namely "Event objects" within Non-Paged pool on
> each run.
>
> This can be seen with Kernel Pool Monitor, or just with this simple PS
> script:
> for ($i = 1; $i -le 100; $i++)
> {
> echo "iteration $i"
> rm -r c:\temp\target
> robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\initdb-template c:\temp\target
> Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | %
> PoolNonpagedBytes
> }
>
> It shows to me:
> iteration 1
>                Total    Copied   Skipped  Mismatch    FAILED Extras
>     Dirs :        27        27         0         0         0 0
>    Files :       968       968         0         0         0 0
>    Bytes :   38.29 m   38.29 m         0         0         0 0
>    Times :   0:00:00   0:00:00                       0:00:00 0:00:00
> ...
> 1226063872
> ...
> iteration 100
>                Total    Copied   Skipped  Mismatch    FAILED Extras
>     Dirs :        27        27         0         0         0 0
>    Files :       968       968         0         0         0 0
>    Bytes :   38.29 m   38.29 m         0         0         0 0
>    Times :   0:00:00   0:00:00                       0:00:00 0:00:00
> ...
> 1245220864
>
> (That is, 0.1-0.2 MB leaks per one robocopy run.)
>
> I observed this on Windows 10 (Version 10.0.19045.4780), with all updates
> installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using
> robocopy v14393 on Windows 10 doesn't affect the issue.
>
> Perhaps this information can be helpful for someone who is running
> buildfarm/CI tests on Windows animals...
>
> [1]
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-11%2007%3A24%3A53
>
>

Interesting, good detective work. Still, wouldn't this mean drongo would
fail consistently?

I wonder why we're using robocopy instead of our own RecursiveCopy module?

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2024-09-14 14:39:11 Re: meson vs windows perl
Previous Message Bruce Momjian 2024-09-14 13:44:14 Re: Detailed release notes