From: | r(dot)zharkov(at)postgrespro(dot)ru |
---|---|
To: | Badrul Chowdhury <bachow(at)microsoft(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Intermittent pg_ctl failures on Windows |
Date: | 2019-07-17 10:40:36 |
Message-ID: | 0057a0ce5a8893e4713a9a9d7e352162@postgrespro.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello!
We reproduced these errors on out buildfarm and my windows workstation.
We used small TAP test that restarts PostgresNode in loop. Additionally,
constant WAITS_PER_SEC in the pg_ctl.c file has been increased 1000
times.
There are two different problems with pg_ctl:
1 - share access to postmaster.pid;
2 - share access to logfile. Postmaster runs in the shell (CMD.EXE).
When the server stops cmd.exe may blocks logfile from been opened by new
cmd.exe.
> waiting for server to shut down.... done
> server stopped
> waiting for server to start....The process cannot access the file
> because it is being used by another process.
> stopped waiting
> pg_ctl: could not start server
> Examine the log output.
To avoid the first error we made two changes:
pg_ctl now opens the postmaster.pid file using pgwin32_open() function
to correctly handle share locks.
On Windows systems we cannot handle ERROR_DELETE_PENDING because
GetLastError() returns ERROR_ACCESS_DENIED instead.
So we rename the lock files before delete them.
To avoid the second error we added the wait for cmd.exe finish in the
do_stop() and do_restart() functions.
Git patch is in the attachment.
On 2018-03-12 18:55, Badrul Chowdhury wrote:
> Hi Tom,
>
> This is a great catch. I am looking into it: I will start by
> reproducing the error as you suggested.
>
> Thanks,
> Badrul
>
> -----Original Message-----
> From: Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> Sent: Saturday, March 10, 2018 2:48 PM
> To: pgsql-hackers(at)lists(dot)postgresql(dot)org
> Subject: Intermittent pg_ctl failures on Windows
>
> The buildfarm's Windows members occasionally show weird pg_ctl
> failures, for instance this recent case:
>
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2018-03-10%252020%253A30%253A20&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=qBtMsJ0EJFs4DVtkA6TZJhCDNlj392uNxsB6MHnu7po%3D&reserved=0
>
> ### Restarting node "master"
> # Running: pg_ctl -D
> G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata
> -l
> G:/prog/bf/root/HEAD/pgsql.build/src/test/recovery/tmp_check/log/006_logical_decoding_master.log
> restart waiting for server to shut down.... done server stopped
> waiting for server to start....The process cannot access the file
> because it is being used by another process.
> stopped waiting
> pg_ctl: could not start server
> Examine the log output.
> Bail out! system pg_ctl failed
>
> or this one:
>
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuildfarm.postgresql.org%2Fcgi-bin%2Fshow_log.pl%3Fnm%3Dbowerbird%26dt%3D2017-12-29%252023%253A30%253A24&data=04%7C01%7Cbachow%40microsoft.com%7C28a8094e84c74c26ecb108d586d91a9d%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636563189370087651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=NdoDkZxBagXpiPDjNmhN6znHh%2BITyjEv2StPpLaabaw%3D&reserved=0
>
> ### Stopping node "subscriber" using mode fast # Running: pg_ctl -D
> c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata
> -m fast stop waiting for server to shut down....pg_ctl: could not open
> PID file
> "c:/prog/bf/root/HEAD/pgsql.build/src/test/subscription/tmp_check/t_001_rep_changes_subscriber_data/pgdata/postmaster.pid":
> Permission denied Bail out! system pg_ctl failed
>
> I'd been writing these off as Microsoft randomness and/or antivirus
> interference, but it suddenly occurred to me that there might be a
> consistent explanation: since commit f13ea95f9, when pg_ctl is waiting
> for server start/stop, it is trying to read postmaster.pid
> more-or-less concurrently with the postmaster writing to that file.
> On Unix that's not much of a problem, but I believe that on Windows
> you have to specifically open the file with sharing enabled, or you
> get error messages like these.
> The postmaster should be enabling sharing, because port.h redirects
> open/fopen to pgwin32_open/pgwin32_fopen which enable the sharing
> flags.
> But it only does that #ifndef FRONTEND. So pg_ctl is just using naked
> open(), which could explain these failures.
>
> If this theory is accurate, it should be pretty easy to replicate the
> problem if you modify the postmaster to hold postmaster.pid open
> longer when rewriting it, e.g. stick fractional-second sleeps into
> CreateLockFile and AddToDataDirLockFile.
>
> I'm not in a position to investigate this in detail nor test a fix,
> but I think somebody should.
>
> regards, tom lane
--
regars, Roman Zharkov
Attachment | Content-Type | Size |
---|---|---|
pg_ctl.patch | text/x-diff | 7.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | David Rowley | 2019-07-17 11:10:52 | Re: Tid scan improvements |
Previous Message | Amit Kapila | 2019-07-17 10:40:07 | Re: SegFault on 9.6.14 |