From: | Andrew Dunstan <andrew(at)dunslane(dot)net> |
---|---|
To: | Alexander Lakhin <exclusion(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Robocopy might be not robust enough for never-ending testing on Windows |
Date: | 2024-09-14 14:22:11 |
Message-ID: | 7e063c60-3fa8-4a63-8601-346788ed2233@dunslane.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2024-09-14 Sa 9:00 AM, Alexander Lakhin wrote:
> Hello hackers,
>
> While trying to reproduce inexplicable drongo failures (e. g., [1])
> seemingly related to BackgroundPsql, I stumbled upon close, but not
> the same issue. After many (6-8 thousands) iterations of the
> 015_stream.pl TAP test, psql failed to start with a
> STATUS_DLL_INIT_FAILED
> error, and a corresponding Windows popup preventing a test exit (see ss-1
> in the archive attached).
>
> Upon reaching that state of the system, following test runs fail with one
> or another error related to memory (not only with psql, but also with the
> server processes):
> testrun/subscription_13/015_stream/log/015_stream_publisher.log:2024-09-11
> 20:01:51.777 PDT [8812] LOG: server process (PID 11532) was
> terminated by exception 0xC00000FD
> testrun/subscription_14/015_stream/log/015_stream_subscriber.log:2024-09-11
> 20:01:19.806 PDT [2036] LOG: server process (PID 10548) was
> terminated by exception 0xC00000FD
> testrun/subscription_16/015_stream/log/015_stream_publisher.log:2024-09-11
> 19:59:41.513 PDT [9128] LOG: server process (PID 14476) was
> terminated by exception 0xC0000409
> testrun/subscription_19/015_stream/log/015_stream_publisher.log:2024-09-11
> 20:03:27.801 PDT [10156] LOG: server process (PID 2236) was
> terminated by exception 0xC0000409
> testrun/subscription_20/015_stream/log/015_stream_publisher.log:2024-09-11
> 19:59:41.359 PDT [10656] LOG: server process (PID 14712) was
> terminated by exception 0xC000012D
> testrun/subscription_3/015_stream/log/015_stream_publisher.log:2024-09-11
> 20:02:23.815 PDT [13704] LOG: server process (PID 13992) was
> terminated by exception 0xC00000FD
> testrun/subscription_9/015_stream/log/015_stream_publisher.log:2024-09-11
> 19:59:41.360 PDT [9760] LOG: server process (PID 11608) was
> terminated by exception 0xC0000142
> ...
>
> When tests fail, I see Commit Charge reaching 100% (see ss-2 in the
> attachment), while Physical Memory isn't all used up. To get OS to a
> working state, I had to reboot it — killing processes, logoff/logon
> didn't
> help.
>
> I reproduced this issue again, investigated it and found out that it is
> caused by robocopy (used by PostgreSQL::Test::Cluster->init), which is
> leaking kernel objects, namely "Event objects" within Non-Paged pool on
> each run.
>
> This can be seen with Kernel Pool Monitor, or just with this simple PS
> script:
> for ($i = 1; $i -le 100; $i++)
> {
> echo "iteration $i"
> rm -r c:\temp\target
> robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\initdb-template c:\temp\target
> Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | %
> PoolNonpagedBytes
> }
>
> It shows to me:
> iteration 1
> Total Copied Skipped Mismatch FAILED Extras
> Dirs : 27 27 0 0 0 0
> Files : 968 968 0 0 0 0
> Bytes : 38.29 m 38.29 m 0 0 0 0
> Times : 0:00:00 0:00:00 0:00:00 0:00:00
> ...
> 1226063872
> ...
> iteration 100
> Total Copied Skipped Mismatch FAILED Extras
> Dirs : 27 27 0 0 0 0
> Files : 968 968 0 0 0 0
> Bytes : 38.29 m 38.29 m 0 0 0 0
> Times : 0:00:00 0:00:00 0:00:00 0:00:00
> ...
> 1245220864
>
> (That is, 0.1-0.2 MB leaks per one robocopy run.)
>
> I observed this on Windows 10 (Version 10.0.19045.4780), with all updates
> installed, but not on Windows Server 2016 (10.0.14393.0). Moreover, using
> robocopy v14393 on Windows 10 doesn't affect the issue.
>
> Perhaps this information can be helpful for someone who is running
> buildfarm/CI tests on Windows animals...
>
> [1]
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-09-11%2007%3A24%3A53
>
>
Interesting, good detective work. Still, wouldn't this mean drongo would
fail consistently?
I wonder why we're using robocopy instead of our own RecursiveCopy module?
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2024-09-14 14:39:11 | Re: meson vs windows perl |
Previous Message | Bruce Momjian | 2024-09-14 13:44:14 | Re: Detailed release notes |